Knights Corner Instruction Set Reference 
Manual 


June 6, 2012 


Reference Number: 327364-001 


(intel. 


INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EX- 
PRESS ORIMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY 
THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, 
INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, 
RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO 
FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT 
OR OTHER INTELLECTUAL PROPERTY RIGHT. 


A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or 
indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH 
MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUB- 
CONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS 
AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT 
OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING 
IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRAC- 
TOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS 
PARTS. 


Intel may make changes to specifications and product descriptions at any time, without notice. Designers must 
not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel 
reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities 
arising from future changes to them. The information here is subject to change without notice. Do not finalize a 
design with this information. 


The products described in this document may contain design defects or errors known as errata which may cause 
the product to deviate from published specifications. Current characterized errata are available on request. 


Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your 
product order. 


Copies of documents which have an order number and are referenced in this document, or other Intel literature, 
may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm 


Intel, the Intel® logo, Intel® Pentium®, Intel® Xeon®, Intel® Pentium® 4 Processor, Intel® Core” Duo, Intel® 
Core™ 2 Duo, MMX™, Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® 
AVX) are trademarks or registered trademarks of Intel® Corporation or its subsidiaries in the United States and 
other countries. *Other names and brands may be claimed as the property of others. 


Copyright 2012 Intel® Corporation. All rights reserved. 


2 Reference Number: 327364-001 


= 
=r 
(3 


CONTENTS 

Contents 
1 Introduction 20 
2 Instructions Terminology and State 21 
2.1 Overview of the Knights Corner instructions Extensions. ........... 00000 eee 21 
ZAd «What are vectOrs? 4 40 cg Sag had eA he A a ee ae Se eas 21 
212° Vectormask registers::...¢4 0954 fades dab da ee HES REY Sas ba See 4 21 
Z2A2Z:1. VECtOr MASK KO vce see eee we ee wa i el Ge ee ws A we A 22 
212.2 Example Ofusé 2 p64 24:8 ve k BRAS RR Rae Ree ae ek eR Oe RR 22 
2.1.3. Understanding Knights Corner instructions ......... 0.000002 eee eee 23 
2.1.3.1 Knights Corner instructions Vector Instructions ..............-..... 24 
2.1.3.2 Knights Corner instructions Vector Memory Instructions: ............. 25 
2.1.3.3 Knights Corner instructions vector mask Instructions................ 26 
2.1.3.4 Knights Corner instructions New Scalar Instructions ................ 26 
2.2 Knights Corner instructions Swizzles and Converts... ....-. 0-0 0 eee eee 27 
221° load-Op Swizzle/Convert ..%.. a6 ge dae dae dee RS REY eae bb ae bee 28 
2:2.2° Toad Up-COnVErt ..05 ee GS. Ae Pe YG wk ew SS hoe ee ee ae a 30 
2:49 DOWN-CONVELSION + ei ke he RRR RARER REESE RE RR Re ee 32 
2.3 StaticRounding Mode ....... 2.2... 0. ee 35 
2.4 Knights Corner Execution Environments .......... 00000 ee 36 
3 Knights Corner Instruction Format 40 
BL, (OVERVICWE esis ot alse ait nah em GG pe Ue Wied oy a we eee ee Gk lee gta an ten 40 
3:2. Instruction: Formats: iio oe eee ee ee ee we i ee a Pa ae ee Bee be ee 40 


Reference Number: 327364-001 3 


CONTENTS 
3:2:1 .MVEX/VEX and the LOGK prefix. 62:62: se deen eee ee wee we eG RES 40 
3.2.2. MVEX/VEX and the 66H, F2H, and F3H prefixes ..........0-. 0000p eee 41 
3.2.3. MVEX/VEX and the REX prefix ... 2.2.0.0 0. 41 
3:3. The MVEX Prefix. ..0 ¢ 20 Go0G8G¢ Ge aG Ode GO eR eae EE eR Eee 41 
3.3.1 Vector SIB (VSIB) Memory Addressing... ....... 0.0000 ee 43 
34 "The VEX Prefix (i. 6 ie cue ae De wee ee Pe ee a Sete A Pe ee 43 
3.5 Knights Corner instructions Assembly Syntax... . 2... .. 0-00 ee 45 
D0. “NOAH OMe cia, apie ae nee rare Hees ant os dp nies to tre eo Ab Bh eae cdee me sede oa ate Sa og, we eae Sees 45 
3.6.1 Operand Notation... . 2... 2.00.0 0 ee 45 
3.6.2 TheDisplacement Bytes ........ 0.000 ee ee 46 
3.6.3 Memory size and disp8*N calculation ... 2... 2.0.02. 46 
3.7 ER hinticos. 6 o ack et we Stee ded oe ee oe I ete de Me Swe ae te 49 
3.8 Functions and Tables Used .. 1... 51 
3.8.1 MemLoadand MemStore ..... 0... 0. ee 51 
3.8.2 SwizzUpConvLoad, UpConvLoad and DownConvStore ...........-.-.++20045 51 
3.8.3. Other Functions/Identifiers ........ 0.002 0 52 
4 Floating-Point Environment, Memory Addressing, and Processor State 54 
AD, OVERVIEW -: 5 es ic ba ew Owe ew owe a ewe ee ee A ewe ee 54 
4.1.1 Suppress All Exceptions Attribute (SAE)... ...... 00.0000: eee ee 54 
4.1.2 SIMD Floating-Point Exceptions ........ 0.0000 ee 55 
4.1.3. SIMD Floating-Point Exception Conditions ............ 0000002 e eee 55 
4.1.3.1 Invalid Operation Exception (#]) ........... 00000000 beets 56 
4.1.3.2 Divide-By-Zero Exception (#Z) ......... 0000000: eee eee eee 56 
4.1.3.3. Denormal Operand Exception (#D)...........-.... 020000200008 56 
4.1.3.4 Numeric Overflow Exception (#0) ...........-.. 00000022 eee 57 
4.1.3.5 Numeric Underflow Exception (#U) .............-. 0000050000] 58 
4.1.3.6 Inexact Result (Precision) Exception (#P)...............-..-005,4 58 
4.2 Denormal FlushingControl.... 2... 0.0.0. 0 ee 58 
4.2.1. Denormal control in up-conversions and down-conversions............+.++4. 58 


4 Reference Number: 327364-001 


CONTENTS 


4.2.1.1 Up-conversions ....... 0... ee 

4.2.1.2 Down-conversions ... 1... 0.000 0c ee ee 

4.3 Extended Addressing Displacements .............-.-+0+2000- 
4.4 Swizzle/up-conversion exceptions ............. 0000+ ee eee 
4.5 Accessing uncacheablememory ..........--. 00002 eee eee 
4.5.1 Memoryread operations...........- 0000 eee ee eee 

4.5.2 vloadunpackh*/vloadunpackl* ............... 20020004 
A'S 3 veatherd*’s 5.226008 ale So Seed glk ws he et See ee eG ee 
4.54 Memorystores ....... 0.0000 cee es 

4.6 Floating-point Notes......... 0.0000 ee 
4.6.1 Rounding Modes ........... 00002 eee eee 
4.6.1.1 Swizzle-explicitrounding modes ................ 

4.6.1.2 Definition and propagationofNaNs .............. 

4.6.1.3 Signed Zeros... 2... 2.0.00 


4.6.2 REX prefix and Knights Corner instructions interactions 


4.7 Knights Corner instructions StateSave .........-..000 000022 eae 


4.8 Knights Corner instructions Processor State After Reset. ............ 


5 Instruction Set Reference 


5.1 Interpreting Instruction Reference Pages...........-.-+2-22000- 
5.1.1 Instruction Format ... 1... 0... ee 
5.1.2 Opcode Notations for MVEX Encoded Instructions............ 
5.1.3. Opcode Notations for VEX Encoded Instructions. ............ 


6 Instruction Descriptions 


6.1 Vector Mask Instructions .......... 0.0000 cee ee eee 
JKNZD -- Jump nearifmaskisnot zero ...........-.2 000000 +e ee 
JKZD -- Jump nearifmaskis zero ........0.. 2.00002 eee ee 


KAND -- AND Vector Mask 1... 0... 


Reference Number: 327364-001 


ine goto Tbe 58 


CONTENTS 
KANDNR -- Reverse AND NOT Vector Mask .... 2.2... 2 ee 82 
KCONCATH -- Pack and Move High Vector Mask... 2... 0-0. eee 84 
KCONCATL -- Pack and Move Low Vector Mask ... 2... 00000 eee ee 86 
KEXTRACT -- Extract Vector Mask From Register ..... 2... 0.000 eee es 88 
KMERGE2L1H -- Swap and Merge High Element Portion and Low Portion of Vector Masks .... 90 
KMERGE2L1L -- Move Low Element Portion into High Portion of Vector Mask............ 92 
KMOV -- Move Vector Mask... 2... ee 94 
KNOT -- Not Vector Mask .. 1... ee 96 
KOR-- OR Vector Masks... 1... ee 98 
KORTEST -- OR Vector Mask And Set EFLAGS .. 1... 2. 100 
KXNOR -- XNOR Vector Masks ... 1... 102 
KXOR -- XOR Vector Masks ... 2... ee 104 
6:2, Vector INStrUuctiOns 20 ale ewe ee ee Pk Bee ee ee ee kee eee oe eae ee 106 
VADDNPD -- Add and Negate Float64 Vectors ........ 0.000 tee eee 107 
VADDNPS -- Add and Negate Float32 Vectors .......-. 0.00 eee ee 110 
VADDPD -- Add Float64 Vectors ... 2.2... ee ee 113 
VADDPS -- Add Float32 Vectors ... 0.2... ee 116 
VADDSETSPS -- Add Float32 Vectors and Set Maskto Sign. .......-..-.02.000 0200005 119 
VALIGND -- Align Doubleword Vectors ........-.. 000 0c ete ee 123 
VBLENDMPD -- Blend Float64 Vectors using the Instruction Mask..............-..-.. 125 
VBLENDMPS -- Blend Float32 Vectors using the Instruction Mask ...........-...-..+.. 128 
VBROADCASTF32X4 -- Broadcast 4xFloat32 Vector ... 2... 0-00 eee ee 131 
VBROADCASTF64X4 -- Broadcast 4xFloat64 Vector .... 2... 0.2.0.0 eee eee 133 
VBROADCASTI32X4 -- Broadcast 4xInt32 Vector... 2... ee ee 135 
VBROADCASTI64X4 -- Broadcast 4xInt64 Vector... 2... . ee ee 137 
VBROADCASTSD -- Broadcast Float64 Vector ... 2... 0... ee ee 139 
VBROADCASTSS -- Broadcast Float32 Vector ... 2... 0-00. ee ee 141 
VCMPPD -- Compare Float64 Vectors and Set Vector Mask. .......-..-0. 0200002 e eee 143 
VCMPPS -- Compare Float32 Vectors and Set Vector Mask........-..-. 020000000 148 


6 Reference Number: 327364-001 


= 
=r 
é 


CONTENTS 
VCVTDQ2PD -- Convert Int32 Vector to Float64 Vector... . 1... 153 
VCVTFXPNTDQ2PS -- Convert Fixed Point Int32 Vector to Float32 Vector...............% 156 
VCVTFXPNTPD2DQ -- Convert Float64 Vector to Fixed Point Int32 Vector .............. 160 
VCVTFXPNTPD2UDQ -- Convert Float64 Vector to Fixed Point Uint32 Vector............. 164 
VCVTFXPNTPS2DQ -- Convert Float32 Vector to Fixed Point Int32 Vector...............4 168 
VCVTFXPNTPS2UDQ -- Convert Float32 Vector to Fixed Point Uint32 Vector............. 172 
VCVTFXPNTUDQ2PS -- Convert Fixed Point Uint32 Vector to Float32 Vector............. 176 
VCVTPD2PS -- Convert Float64 Vector to Float32 Vector.........-...0 00000 eee 179 
VCVTPS2PD -- Convert Float32 Vector to Float64 Vector.............000 002 eee eee 183 
VCVTUDQ2PD -- Convert Uint32 Vector to Float64 Vector... 1... . ee 186 
VEXP223PS -- Base-2 Exponential Calculation of Float32 Vector ............-..+.2+44 189 
VFIXUPNANPD -- Fix Up Special Float64 Vector Numbers With NaN Passthrough.......... 192 
VFIXUPNANPS -- Fix Up Special Float32 Vector Numbers With NaN Passthrough .......... 196 


VFMADD132PD -- Multiply Destination By Second Source and Add To First Source Float64 Vectors200 
VFMADD132PS -- Multiply Destination By Second Source and Add To First Source Float32 Vectors 204 
VFMADD213PD -- Multiply First Source By Destination and Add Second Source Float64 Vectors . 207 
VFMADD213PS -- Multiply First Source By Destination and Add Second Source Float32 Vectors . 211 
VFMADD231PD -- Multiply First Source By Second Source and Add To Destination Float64 Vectors215 
VFMADD231PS -- Multiply First Source By Second Source and Add To Destination Float32 Vectors 219 


VFMADD233PS -- Multiply First Source By Specially Swizzled Second Source and Add To Second 
Source: Float32: Vectors: 2: nadie hae OR eS eR eee ee ee ea ee ee eee 223 


VFMSUB132PD -- Multiply Destination By Second Source and Subtract First Source Float64 Vectors227 
VFMSUB132PS -- Multiply Destination By Second Source and Subtract First Source Float32 Vectors231 
VFMSUB213PD -- Multiply First Source By Destination and Subtract Second Source Float64 Vectors234 
VFMSUB213PS -- Multiply First Source By Destination and Subtract Second Source Float32 Vectors238 
VFMSUB231PD -- Multiply First Source By Second Source and Subtract Destination Float64 Vectors241 
VFMSUB231PS -- Multiply First Source By Second Source and Subtract Destination Float32 Vectors245 


VFNMADD132PD -- Multiply Destination By Second Source and Subtract From First Source 
PlGatG4 VeCtONS es i iia aris dente tek pea cick Sule oy eh ay rhe ne Rooke ce he Doane he ee nae 248 


VFNMADD132PS -- Multiply Destination By Second Source and Subtract From First Source 
Float32: VeCtOlSs. v2 ae eG ee Ek a Oa ORR ERS Pa eS ad Ee eae ee a ek 252 


Reference Number: 327364-001 7 


intel, 
CONTENTS nu 


VFNMADD213PD -- Multiply First Source By Destination and Subtract From Second Source 


Float64 Vectors... of 4 o2a0 eee bee ee ARCA RPA R PERE OEE ea ea wae ee ee ek 256 
VFNMADD213PS -- Multiply First Source By Destination and Subtract From Second Source 

Float32 VECtOrs «3 16:33, Ge ce Ae hak RA te BO Bee eo se te Se ee ee 260 
VFNMADD231PD -- Multiply First Source By Second Source and Subtract From Destination 

Float64:VeCtOYrs sc tis  RS e  E e OR  e  ee Pe P 264 
VFNMADD231PS -- Multiply First Source By Second Source and Subtract From Destination 

Floats 2 VeCLONS ciate atsene: aetaptnts 4, wee acon oly, ake tea gta Bee aca sect Oe agen age wh ares an ee eee 268 
VFNMSUB132PD -- Multiply Destination By Second Source, Negate, and Subtract First Source 

PlO@atG4: VeClOMSis sige 655 aes he cteas, ae nt ys eS a ML eg aac eo Asa a eee See 272 
VFNMSUB132PS -- Multiply Destination By Second Source, Negate, and Subtract First Source 

FlOat32 VeCtOYS 1 pe RGSS hale AMO G eS PEER eR Ree ee a Rae I 276 
VFNMSUB213PD -- Multiply First Source By Destination, Negate, and Subtract Second Source 

Float64 VECtOrs is fi: soe 8h tee ke ek ah ese we ee Rl a cnt eae we a ES kt Oe cee De 280 
VFNMSUB213PS -- Multiply First Source By Destination, Negate, and Subtract Second Source 

Float32 VeCtOYs :o560s 6 fis lik ee wk Oe ee eee RR ee Ed eee eee ees 284 
VFNMSUB231PD -- Multiply First Source By Second Source, Negate, and Subtract Destination 

FIO atG4 VeCLOMS's taieiat eres Ge tarts, sarees cae oth nace cin fe oe hs Br c's eh wee ee deh vet fm ae de tna San ee tatioahe SAI ey 288 
VFNMSUB231PS -- Multiply First Source By Second Source, Negate, and Subtract Destination 

PlOat32:VeCctOrsc.r pea ORE ae ORR EEE ES RE eee hE Eee Rs 292 
VGATHERDPD -- Gather Float64 Vector With Signed Dword Indices .............-..... 296 
VGATHERDPS -- Gather Float32 Vector With Signed Dword Indices .................. 299 
VGATHERPFODPS -- Gather Prefetch Float32 Vector With Signed Dword Indices IntoL1...... 302 
VGATHERPFOHINTDPD -- Gather Prefetch Float64 Vector Hint With Signed Dword Indices ... . 305 
VGATHERPFOHINTDPS -- Gather Prefetch Float32 Vector Hint With Signed Dword Indices ... . 307 
VGATHERPF1DPS -- Gather Prefetch Float32 Vector With Signed Dword Indices IntoL2...... 309 
VGETEXPPD -- Extract Float64 Vector of Exponents from Float64 Vector ............... 312 
VGETEXPPS -- Extract Float32 Vector of Exponents from Float32 Vector ............... 315 
VGETMANTPD -- Extract Float64 Vector of Normalized Mantissas from Float64 Vector....... 318 
VGETMANTPS -- Extract Float32 Vector of Normalized Mantissas from Float32 Vector....... 323 
VGMAXABSPS -- Absolute Maximum of Float32 Vectors .. 1... 0. ee 328 
VGMAXPD -- Maximum of Float64 Vectors .. 6... ee 332 
VGMAXPS -- Maximum of Float32 Vectors .. 1... 336 
VGMINPD -- Minimum of Float64 Vectors .. 1... 340 


8 Reference Number: 327364-001 


= 
=r 
é 


CONTENTS 
VGMINPS -- Minimum of Float32 Vectors... 6... 344 
VLOADUNPACKHD -- Load Unaligned High And Unpack To Doubleword Vector ........... 348 
VLOADUNPACKHPD -- Load Unaligned High And Unpack To Float64 Vector ............. 351 
VLOADUNPACKHPS -- Load Unaligned High And Unpack To Float32 Vector ............. 354 
VLOADUNPACKHQ -- Load Unaligned High And Unpack To Int64 Vector ............... 357 
VLOADUNPACKLD -- Load Unaligned Low And Unpack To Doubleword Vector ........... 360 
VLOADUNPACKLPD -- Load Unaligned Low And Unpack To Float64 Vector ............. 363 
VLOADUNPACKLPS -- Load Unaligned Low And Unpack To Float32 Vector .............. 366 
VLOADUNPACKLQ -- Load Unaligned Low And Unpack To Int64 Vector ............... 369 
VLOG2PS -- Vector Logarithm Base-2 of Float32 Vector ...........-0 000000 eee 372 
VMOVAPD -- Move Aligned Float64 Vector... 2... 2.0... eee ee 375 
VMOVAPS -- Move Aligned Float32 Vector .. 1... . 20.0. ee ee 378 
VMOVDQA32 -- Move Aligned Int32 Vector ........-. 0.000 ee 381 
VMOVDQA64 -- Move Aligned Int64 Vector ......... 0.0002 eee ee 384 
VMOVNRAPD -- Store Aligned Float64 Vector With No-Read Hint ................... 387 
VMOVNRAPS -- Store Aligned Float32 Vector With No-Read Hint...............-.... 389 
VMOVNRNGOAPD -- Non-globally Ordered Store Aligned Float64 Vector With No-Read Hint .. . 392 
VMOVNRNGOAPS -- Non-globally Ordered Store Aligned Float32 Vector With No-Read Hint ... 395 
VMULPD -- Multiply Float64 Vectors ..... 0... 0... ce ee 398 
VMULPS -- Multiply Float32 Vectors... 2... 2. ee 401 
VPACKSTOREHD -- Pack And Store Unaligned High From Int32 Vector ................ 404 
VPACKSTOREHPD -- Pack And Store Unaligned High From Float64 Vector .............. 407 
VPACKSTOREHPS -- Pack And Store Unaligned High From Float32 Vector .............. 410 
VPACKSTOREHQ -- Pack And Store Unaligned High From Int64 Vector ................ 413 
VPACKSTORELD -- Pack and Store Unaligned Low From Int32 Vector ................. 416 
VPACKSTORELPD -- Pack and Store Unaligned Low From Float64 Vector ............... 419 
VPACKSTORELPS -- Pack and Store Unaligned Low From Float32 Vector ............... 422 
VPACKSTORELQ -- Pack and Store Unaligned Low From Int64 Vector ................. 425 
VPADCD -- Add Int32 Vectors with Carry... 2.2.0... ee 428 


Reference Number: 327364-001 9 


CONTENTS 

VPADDD -- Add Int32 Vectors 2.1... 431 
VPADDSETCD -- Add Int32 Vectors and Set Mask to Carry........-...-0. 00000022 ee 434 
VPADDSETSD -- Add Int32 Vectors and Set Maskto Sign... ......-..2.-2.0-000+000048 437 
VPANDD -- Bitwise AND Int32 Vectors .. 0... 440 
VPANDND -- Bitwise AND NOT Int32 Vectors ... 0... ce 443 
VPANDNQ -- Bitwise AND NOT Int64 Vectors .. 0... ee ee 446 
VPANDQ -- Bitwise AND Int64 Vectors ........ 00. 449 
VPBLENDMD -- Blend Int32 Vectors using the Instruction Mask ............-..-+.++.4 452 
VPBLENDMQ -- Blend Int64 Vectors using the Instruction Mask ............-..-.2+4.4 455 
VPBROADCASTD -- Broadcast Int32 Vector 2... 458 
VPBROADCASTQ -- Broadcast Int64 Vector 2.1... 460 
VPCMPD -- Compare Int32 Vectors and Set Vector Mask .........-. 0000002 eeee 462 
VPCMPEQD -- Compare Equal Int32 Vectors and Set Vector Mask ...........-..-.2+54 466 
VPCMPGTD -- Compare Greater Than Int32 Vectors and Set Vector Mask............... 469 
VPCMPLTD -- Compare Less Than Int32 Vectors and Set Vector Mask .............-..+.. 472 
VPCMPUD -- Compare Uint32 Vectors and Set Vector Mask ............-2.0000+00048 475 
VPERMD -- Permutes Int32 Vectors .. 2... 479 
VPERMF32X4 -- Shuffle Vector Dqwords .......-.. 0-000 et ee 481 
VPGATHERDD -- Gather Int32 Vector With Signed Dword Indices ..............-.... 483 
VPGATHERDQ -- Gather Int64 Vector With Signed Dword Indices ................... 486 
VPMADD231D -- Multiply First Source By Second Source and Add To Destination Int32 Vectors . 489 
VPMADD233D -- Multiply First Source By Specially Swizzled Second Source and Add To Second 

Source Int32 VectONs «2: sc ee he RRR RR Ee 492 
VPMAXSD -- Maximum of Int32 Vectors 2.1... 496 
VPMAXUD -- Maximum of Uint32 Vectors .. 2... 499 
VPMINSD -- Minimum of Int32 Vectors .. 6... 502 
VPMINUD -- Minimum of Uint32 Vectors... 6... 505 
VPMULHD -- Multiply Int32 Vectors And Store High Result ............-..-2-+0004 508 
VPMULHUD -- Multiply Uint32 Vectors And Store High Result .............-..-.-.. 511 
VPMULLD -- Multiply Int32 Vectors And Store Low Result. .......-...-.-2002 02000058 514 


10 Reference Number: 327364-001 


= 
=r 
é 


CONTENTS 
VPORD -- Bitwise OR Int32 Vectors .. 1... 517 
VPORQ -- Bitwise OR Int64 Vectors .. 1... 0... 520 
VPSBBD -- Subtract Int32 Vectors with Borrow .......... 00000 eee ee 523 
VPSBBRD -- Reverse Subtract Int32 Vectors with Borrow ...........-2-2000+ +2 eee 526 
VPSCATTERDD -- Scatter Int32 Vector With Signed Dword Indices..............-.... 529 
VPSCATTERDQ -- Scatter Int64 Vector With Signed Dword Indices................... 532 
VPSHUFD -- Shuffle Vector Doublewords..........-0 000 eee ee 535 
VPSLLD -- Shift Int32 Vector Immediate Left Logical ...........2..2.0.2.0200-00004 537 
VPSLLVD -- Shift Int32 Vector Left Logical .. 2... 2... ee 540 
VPSRAD -- Shift Int32 Vector Immediate Right Arithmetic................-..-.2-5. 543 
VPSRAVD -- Shift Int32 Vector Right Arithmetic. ............ 0.000000 eee eee 546 
VPSRLD -- Shift Int32 Vector Immediate Right Logical .............-..2.0-0-0-00004 549 
VPSRLVD -- Shift Int32 Vector Right Logical ... 2... 2... ee 552 
VPSUBD -- Subtract Int32 Vectors... 2... ee 555 
VPSUBRD -- Reverse Subtract Int32 Vectors ... 2... 2.00. 0c ee 558 
VPSUBRSETBD -- Reverse Subtract Int32 Vectors andSet Borrow ...........-.++2+5+5 561 
VPSUBSETBD -- Subtract Int32 Vectors and Set Borrow .............0000 000+ eee 564 
VPTESTMD -- Logical AND Int32 Vectors and Set Vector Mask ........-..-..-.-+0055 567 
VPXORD -- Bitwise XOR Int32 Vectors... 2... 570 
VPXORQ -- Bitwise XOR Int64 Vectors... 1... ee 573 
VRCP23PS -- Reciprocal of Float32 Vector ... 2.2... ee 576 
VRNDFXPNTPD -- Round Float64 Vector... 2... 00.0 ee 579 
VRNDFXPNTPS -- Round Float32 Vector ... 2... ee 583 
VRSQRT23PS -- Vector Reciprocal Square Root of Float32 Vector............-.++2+45 587 
VSCALEPS -- Scale Float32 Vectors .. 2... 0. ee 590 
VSCATTERDPD -- Scatter Float64 Vector With Signed Dword Indices ................. 594 
VSCATTERDPS -- Scatter Float32 Vector With Signed Dword Indices ................. 597 
VSCATTERPFODPS -- Scatter Prefetch Float32 Vector With Signed Dword Indices Into L1 ..... 600 
VSCATTERPFOHINTDPD -- Scatter Prefetch Float64 Vector Hint With Signed Dword Indices .. . 603 


Reference Number: 327364-001 11 


CONTENTS 


VSCATTERPFOHINTDPS -- Scatter Prefetch Float32 Vector Hint With Signed Dword Indices... . 
VSCATTERPF1DPS -- Scatter Prefetch Float32 Vector With Signed Dword Indices IntoL2 ..... 


VSUBPD -- Subtract Float64 Vectors. 2... 


A_ Scalar Instruction Descriptions 
CLEVICTO ==Evict LLANE: 3 ae eG eke Vp eee ek bebe we be Rack MEP ee be eS 
GLEVICT 12 Evicti 2 in @ <5. ses ara Hei eo We kan ee ee ee Rar oe ae nate 
DELAY == Stall Thread 2 ccc ce ce ee ER RR Re 
LZCNT -- Leading Zero Count... 2... 
POPCNT -- Return the Count of Number of BitsSettol ........ 0.0... 0000. eee 
SPFLT -- Set performance monitor filtering mask ..........0. 000000 eee ee 
TZCNT -- Trailing Zero Count... 2.2... ee 
TZCNTI -- Initialized Trailing ZeroCount... 1.2.2.2... 00.002 eee ee 
VPREFETCHO -- Prefetch memory line using TO hint .................2..-2.-005.4 
VPREFETCH1 -- Prefetch memory line using T1lhint.................2..-2.-0054 
VPREFETCH2 -- Prefetch memory line using T2 hint .................2..-0-+0004 
VPREFETCHEO -- Prefetch memory line using TO hint, with intent towrite ............. 
VPREFETCHE1 -- Prefetch memory line using T1 hint, withintent towrite ............. 
VPREFETCHE2Z -- Prefetch memory line using T2 hint, withintent towrite ............. 
VPREFETCHENTA -- Prefetch memory line using NTA hint, with intent to write ........... 


VPREFETCHNTA -- Prefetch memory line using NTAhint ................-..-0-5.4 


B_ Knights Corner 64 bit Mode Scalar Instruction Support 
B.1 64 bit Mode General-Purpose and X87 Instructions ......... 2.00000 eee es 
B.2. Knights Corner 64 bit Mode Limitations .......... 00.0000 eee 
B.3. LDMXCSR -- Load MXCSR Register... 2... ee 


B.4. FXRSTOR -- Restore x87 FPU and MXCSR State .. 1... ee 


12 Reference Number: 327364-001 


(intel. 
CONTENTS 
B.5 FXSAVE -- Save x87 FPU and MXCSR State . 1... 663 
B.6 RDPMC -- Read Performance-Monitoring Counters .......... 00000 eee 665 
B.7 STMXCSR -- Store MXCSR Register... 0... 668 
B.8 CPUID -- CPUID Identification . 2... ee 669 
C Floating-Point Exception Summary 681 
C.1 Instruction floating-point exception summary... .. 2.2... 006 ee 681 
C.2 Conversion floating-point exceptionsummary ......... 0.000: 683 
G3 Denormalbehavior «2.6.64 e ecb eee EE ee ee 684 
D Instruction Attributes and Categories 689 
D.1 Conversion Instruction Families... 0... 690 
D.1.1  Dy3 Family of Instructions .. 1... ee 690 
D.1.2 Dyes Family of Instructions .. 1... 690 
D.1.3. Dj32 Family of Instructions ... 2... 00. 690 
D.1.4 Djg4 Family of Instructions ... 2... 0... 690 
D.1.5 S39 Family of Instructions «6... 690 
D.1.6 Syea Family of Instructions «0... 690 
D.1.7 Sigg Family of Instructions... 2... 691 
D.1.8 Sjeg4 Family of Instructions... 2... 2... 691 
D.1.9 Ugg Family of Instructions «6... 691 
D.1.10 Uyga Family of Instructions «6... es 691 
D.1.11 Ujgg Family of Instructions ........ 0.0 0 691 
D.1.12 Ujg4 Family of Instructions ... 2... 0... 0 691 
E Non-faulting Undefined Opcodes 692 
F General Templates 694 
F1 Mask Operation Templates ... 2... 2... 0 695 
Maskm0--Template ... 2.0... 0. ee 696 
Mask im1.-Templatée” 220.2 2. 2 eek eae enke bea EAS ee A Sa RE ew Dae ee ee eee 697 


Reference Number: 327364-001 


CONTENTS 
Mask m2 *=Templaté> 20h: 4:25 5 bad bE eee ER PRS eee ee ase ee RE he 4 698 
Maskm3-=Templaté: 20:52 db ce ee ee ee ed ee ee 699 
Maskm4--Template ... 2.0.0.0. ee 700 
Maskm5-- Template ... 2.0.0... ee 701 
F2 Vector Operation Templates .... 2.2.0... 2. ee 702 
Vector v0 --Template ... 2... 0... ee 703 
Vector'y L=-"Téemplate- iis 2 eet eek eh ne SS pee Oe eee ad ea Se oe on BRERA 705 
Vector vi0--- Template: gic ace a's eG Bad ea os Dae G eke eR Ee eee ae eS 706 
Vectorv11--Template ..... 2.2... 0... ec ee 708 
Vectorv2--Template ... 1... 0.2.00. ee 709 
Vector v3i=- Template 22324 24s ee ee ee ek oe Be ee see eee, pee 711 
Vectorv4=-Template. 223 22d bebe ph ee bee hee e be bee bh bbe bE ES EEE ERE ES 712 
Vectorv5 --Template ... 2... 0... 2 714 
Vector v6--Template ... 2... 0... 2 ee 716 
Vector'v7 == Template: 2 sa:0:s% eis ae baw we Oe DR aa eters es 2 wee we wee 717 
Vector'v8 +> Template: gc. echt ee ae Ce a ee ee Ye ee Dee oa ed 718 
Vector'v9 > Template: cnc. 6 4c8 ies Ae Yee eS ee ee Ee eee ee ate 720 
F3 ScalarOperationTemplates .... 2... 0... ee ee 721 
Scalar'sO'==Témplat€ 2224525 eee e eee eee be ee ba oa Rae aw Dace eee eae 722 
Scalars1--Template ... 2... ee 723 


14 Reference Number: 327364-001 


= 
=r 
(3 


LIST OF TABLES 


List of Tables 


2.1 


2.2 


2.3 


2.4 


2.5 


EH attribute syntax. . 2... 


32 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 32 bit elements 
that form one 128-bit block in the source (with 'a' least significant and 'd' most significant), so 
aaaa means that the least significant element of the 128-bit block in the source is replicated to all 
four elements of the same 128-bit block in the destination; the depicted pattern is then repeated 
for all four 128-bit blocks in the source and destination. We use 'ponm lkji hgfe dcba' to denote a 
full Knights Corner instructions source register, where ‘a’ is the least significant element and 'p' is 
the most significant element. However, since each 128-bit block performs the same permutation 
for register swizzles, we only show the least significant block here. Note that in this table as well 
as in subsequent ones from this chapter 525,59 are bits 6-4 from MVEX prefix encoding (see 
PUQUTC S25: ecan ce Sota ee Se asec ws) erecta eects ak ae ae ae exe sacee eee Pee, ee Ee cen ae ae 


64 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 64 bit elements 
that form one 256-bit block in the source (with 'a' least significant and 'd' most significant), so 
aaaa means that the least significant element of the 256-bit block in the source is replicated to 
all four elements of the same 256-bit block in the destination; the depicted pattern is then re- 
peated for the two 256-bit blocks in the source and destination. We use 'hgfe dcba' to denote a 
full Knights Corner instructions source register, where ‘a' is the least significant element and 'h' is 
the most significant element. However, since each 256-bit block performs the same permutation 
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Chapter 1 


Introduction 


This document describes new vector instructions for the co-processor code-named Knights Corner. 


The major features of the new vector instructions described herein are: 


A high performance 64 bit execution environment Knights Corner provides a 64 bit execution environment 
(see Figure 2.1) similar to that found in the Intel64® Intel® Architecture Software Developer's Manual. 
Additionally, Knights Corner instructions provides basic support for float64 and int64 logical operations. 


32 new vector registers Knights Corner's 64 bit environment offers 32 512-bit wide vector SIMD registers 
tailored to boost the performance of high performance computing applications. The 512-bit vector SIMD 
instruction extensions provide comprehensive, native support to handle 32 bit and 64 bit floating-point 
and integer data, including a rich set of conversions for native data types. 


Ternary instructions Most instructions are ternary, with two sources and a different destination. Multi- 
ply&add instructions are ternary with three sources, one of which is also the destination. 


Vector mask support Knights Corner instructions introduces 8 vector mask registers that allow for conditional 
execution over the 16 (or 8) elements in a vector instruction, and merging of the results into the destina- 
tion. Masks allow vectorizing loops that contain conditional statements. Additionally, Knights Corner 
instructions provides support for updating the value of the vector masks with special vector instructions 
such as vempmps. 


Coherent memory model The Knights Corner instructions operates in a memory address space that follows 
the standard defined by the Intel® 64 achitecture. This feature eases the process of developing vector code. 


Gather/Scatter support The Knights Corner instructions features specific gather/scatter instructions that al- 
low manipulation of irregular data patterns of memory (by fetching sparse locations of memory into a 
dense vector register or vice-versa) thus enabling vectorization of algorithms with complex data struc- 
tures. 
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Chapter 2 


Instructions Terminology and State 


The vector streaming SIMD instruction extensions are designed to enhance the performance of Intel® 64 pro- 
cessors for scientific and engineering applications. 


This chapter introduces Knights Corner instructions terminology and relevant processor state. 


2.1 Overview of the Knights Corner instructions Extensions 


2.1.1. What are vectors? 


The vector is the basic working unit of the Knights Corner instructions. Most instructions use at least one vec- 
tor. A vector is defined as a sequence of packed data elements. For Knights Corner instructions the size of a 
vector is 64 bytes. As the support data types are float32, int32, float64 and int64, then a vector consists on ei- 
ther 16 doubleword-size elements or alternatively, 8 quadword-size elements. Only doubleword and quadword 
elements are supported in Knights Corner instructions. 


The number of Knights Corner instructions registers is 32. 


Additionally, Knights Corner instructions features vector masks. Vector masks allow any set of elements in the 
destination to be protected from updates during the execution of any operation. A subset of this functionality 
is the ability to control the vector length of the operation being performed (that is, the span of elements being 
modified, from the first to the last one); however, it is not necessary that the elements that are modified be 
consecutive. 


2.1.2 Vector mask registers 


Most Knights Corner instructions vector instructions use a special extra source, known as the write-mask, 
sourced from a set of 8 registers called vector mask registers. These registers contain one bit for each element 
that can be held by a regular Knights Corner instructions vector register. 


Elements are always either float32, int32, float64 or int64 and the vector size is set to 64 bytes. Therefore, a 
vector register holds either 8 or 16 elements; accordingly, the length of a vector mask register is 16 bits. For 64 
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bit datatype instructions, only the 8 least significant bits of the vector mask register are used. 


A vector mask register affects an instruction for which it is the write-mask operand at element granularity (either 
32 or 64 bits). That means that every element-sized operation and element-sized destination update by a vector 
instruction is predicated on the corresponding bit of the vector mask register used as the write-mask operand. 
That has two implications: 


¢ The instruction's operation is not performed for an element if the corresponding write-mask bit is not 
set. This implies that no exception or violation can be caused by an operation on a masked-off element. 


¢ A destination element is not updated if the corresponding write-mask bit is not set. Thus, the mask in 
effect provides a merging behavior for Knights Corner instructions vector register destinations, thereby 
potentially converting destinations into implicit sources, whenever a write-mask containing any 0-bits is 
used. 


This merging behavior, and the associated performance hazards, can also occur when writing a vector to 
memory via a vector store. Vectors are written on a per element basis, based on the vector mask regis- 
ter used as a write-mask. Therefore, no exception or violation can be caused by a write to a masked-off 
element of a destination vector operand. 


The sticky bits implemented in the MXCSR to indicate that floating-point exceptions occurred, are set based 
soley upon operations on non-masked vector elements. 


The value of a given mask register can be set up as a direct result of a vector comparison instruction, transferred 
from a GP register, or calculated as a direct result of a logical operation between two masks. 


Vector mask registers can be used for purposes other than write-masking. For example, they can be used to to 
set the EFLAGS based on the 0/0xFFFF /other status of the OR of two vector mask registers. A number of the 
Knights Corner instructions are provided to support such uses of the vector mask register. 


2.1.2.1 Vector mask k0 


The only exception to the vector mask rules described above is mask k0. Mask k0 cannot be selected as the write- 
mask for a vector operation; the encoding that would be expected to select mask 0 instead selects an implicit 
mask of OxFFFF, thereby effectively disabling masking. Vector mask k0 can still be used as any non-write-mask 
operand for any instruction that takes vector mask operands; it just can't ever be selected as a write-mask. 


2.1.2.2 Example of use 


Here's an example of a masked vector operation. 
The initial state of vector registers zmm0, zmm1, and zmm2 is: 


MSB LSB 
zmm@ = [ QxQQ000003 Ox00000002 Ox20000001 9x00000000 
[ 0xQ0000007 Ox0QQ00006 Ox00000005 0x00000004 
[ OxQ00000OB OxQQQQOQQOA OxBQ0O0009 0x00000008 
[ OxQQ@QQQOOF OxQQQQOQGE OxGQQOQOOD OxdG00000C 


(bytes 15 through Q) 
(bytes 31 through 16) 
(bytes 47 through 32) 
(bytes 63 through 48) 


Sa a 


[ OxQQQQQOOF OxQQQQQOOF OxQQQQQQOF OxQ0OQOOOF ] (bytes 15 through @) 
[ OxQQ@QQQOOF OxQQQQOOOF OxQQQQQOOF OxQ0QQO0OF ] (bytes 31 through 16) 


zmm1 
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Qx0000000F 
Qx0000000F 


@xAAAAAAAA 
@xBBBBBBBB 
@xCCCCCCCC 
@xDDDDDDDD 


k3 = Ox8FQ3 


Qx0000000F 
Qx0000000F 


@xAAAAAAAA 
@xBBBBBBBB 
@xCCCCCCCC 
@xDDDDDDDD 


Qx0000000F 
0x0000000F 


@xAAAAAAAA 
@xBBBBBBBB 
@xCCCCCCCC 
@xDDDDDDDD 


0x0000000F 
0x0000000F 


@xAAAAAAAA 
@xBBBBBBBB 
@xCCCCCCCC 
@xDDDDDDDD 


Given this state, we will execute the following instruction: 


vpaddd zmm2 {k3}, zmmQ, zmm1 


= a a a 


(bytes 47 through 32) 
(bytes 63 through 48) 


(bytes 15 through Q) 
(bytes 31 through 16) 
(bytes 47 through 32) 
(bytes 63 through 48) 


(1000 1111 0000 0011) 


The vpaddd instruction adds vector elements of 32 bit integers. Since elements are not operated upon when the 
corresponding bit of the mask is not set, the temporary result would be: 


[ KKKKKKKKKK KKKKKAKKKK OXOGOOOO1O OxQQQOOOOF J 
[ KKKKKKKKKK KKKKKKKKKK KKKKKKKKKK KKKKKKKKKK J 


[ @x@QQ0001A QxQ0000019 0x00000018 Ox00000017 | 
[ OxQQQQQQ1E KKKKKKKKKK KKKKKKKKKK KKKKKKKKKK | 


where "*********«" indicates that no operation is performed. 


(bytes 15 through 2Q) 
(bytes 31 through 16) 
(bytes 47 through 32) 
(bytes 63 through 48) 


This temporary result is then written into the destination vector register, zmm2, using vector mask register k3 
as the write-mask, producing the following final result: 


zmm2 = [ O@xAAAAAAAA @xAAAAAAAA Ox00000010 OxOQQOQQ0OF J 
[ @xBBBBBBBB @xBBBBBBBB 9xBBBBBBBB @xBBBBBBBB |] 
[ QxQQ00001A Ox00000019 Ox00000018 0x00000017 |] 
[ @x@0@0001E @xDDDDDDDD @xDDDDDDDD @xDDDDDDDD J 


(bytes 15 through Q) 
(bytes 31 through 16) 
(bytes 47 through 32) 
(bytes 63 through 48) 


Note that for a 64 bit instruction (say vaddpd), only the 8 LSB of mask k3 (0x03) would be used to identify the 
write-mask operation on each one of the 8 elements of the source/destination vectors. 


2.1.3. Understanding Knights Corner instructions 


Knights Corner instructions can be classified depending on the nature of their operands. The majority of the 
Knights Corner instructions operate on vector registers, with a vector mask register serving as a write-mask. 
However, in most cases these instructions may have one of the vector source operands stored in either memory 
or a vector register, and may additionally have one or more non-vector (scalar) operands, such as a Intel® 64 
general purpose register or an immediate value. Additionally, some instructions use vector mask registers as 
destinations and/or explicit sources. Finally, Knights Corner instructions adds some new scalar instructions. 
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From the point of view of instruction formats, there are four main types of Knights Corner instructions: 


e Vector Instructions 
e Vector Memory Instructions 


Vector Mask Instructions 


e New Scalar Instructions 


2.1.3.1 Knights Corner instructions Vector Instructions 


Vector instructions operate on vectors that are sourced from either registers or memory and that can be modified 
prior to the operation via predefined swizzle and convert functions. The destination is usually a vector register, 
though some vector instructions may have a vector mask register as either a second destination or the primary 
destination. 


All these instructions work in an element-wise manner: the first element of the first source vector is operated 
on together with the first element of the second source vector, and the result is stored in the first element of the 
destination vector, and so on for the remaining 15 (or 7) elements. 


As described above, the vector mask() register that serves as the write-mask for a vector instruction determines 
which element locations are actually operated upon; the mask can disable the operation and update for any 
combination of element locations. 


Most vector instructions have three different vector operands (typically, two sources and one destination) ex- 
cept those instructions that have a single source and thus use only two operands. Additionally, most vector 
instructions feature an extra operand in the form of the vector mask() register that serves as the write-mask. 
Thus, we can categorize Knights Corner instructions vector instructions based on the number of vector sources 
they use: 


Vector-Converted Vector/Memory. Vector-converted vector/memory instructions, such as vaddps (which 
adds two vectors), are ternary operations that take two different sources, a vector register and a converted 
vector/memory operand, and a separate destination vector register, as follows: 


zmm@ <= OP(zmm1, S(zmm2, m)) 


where zmm1 is a vector operand that is used as the first source for the instruction, S(zmm2, m) is a con- 
verted vector/memory operand that is used as the second source for the instruction, and the result of 
performing operation OP on the two source operands is written to vector destination register zmm0. 


A converted vector/memory operand is a source vector operand that it is obtained through the process of 
applying a swizzle/conversion function to either a Knights Corner instructions vector ora memory operand. 
The details of the swizzle/conversion function are found in section 2.2; note that its behavior varies de- 
pending on whether the operand is a register or amemory location, and, for memory operands, on whether 
the instruction performs a floating-point or integer operation. Each source memory operand must have 
an address that is aligned to the number of bytes of memory actually accessed by the operand (that is, 
before the swizzle/convert is performed); otherwise, a #GP fault will result. 


Converted Vector/Memory. Converted vector/memory instructions, such as vcvtpu2ps (which converts a vec- 
tor of unsigned integers to a vector of floats), are binary operations that take a single vector source, as 
follows: 
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zmm@ <= OP(S(zmm1, m)) 


Vector-Vector-Converted Vector/Memory. Vector-vector-converted vector/memory instructions, of which 
vfmaddps (multiply-add of three vectors) is a good example, are similar to the vector-converted vec- 
tor/memory family of instructions; here, however, the destination vector register is used as a third source 
as well: 


zmm@ <= OP(zmm@, zmm1, S(zmm2, m)) 


2.1.3.2 Knights Corner instructions Vector Memory Instructions: 


Vector Memory Instructions perform vector loads from and vector stores to memory, with extended conversion 
support. 


As with regular vector instructions, vector memory instructions transfer data from/to memory in an element- 
wise fashion, with the elements that are actually transferred dictated by the contents of the vector mask that is 
selected as the write-mask. 


There are two basic groups of Knights Corner instructions vector memory instructions, vector loads/broadcasts 
and vector stores. 


Vector Loads/Broadcasts. A vector load/broadcast reads a memory source, performs a predefined load con- 
version function, and replicates the result (in the case of broadcasts) to form a 64-byte 16-element vector 
(or 8-element for 64 bit datatypes). This vector is then conditionally written element-wise to the vector 
destination register, with the writes enabled or disabled according to the corresponding bits of the vector 
mask register selected as the write-mask. 


The size of the memory operand is a function of the type of conversion and the number of replications 
to be performed on the memory operand. We call this special memory operand an up-converted memory 
operand. Each source memory operand must have an address that is aligned to the number of bytes of 
memory actually accessed by the operand (that is, before the swizzle/convert is performed); otherwise, a 
#GP fault will result. 


A Vector Load operates as follows: 
zmmQ@ <= U(m) 


where U(m) isan up-converted memory operand whose contents are replicated and written to destination 
register zmm0. The mnemonic dictates the degree of replication and the conversion table. 


A special sub-case of these instructions are Vector Gathers. Vector Gathers are a special form of vector 
loads where, instead of a consecutive chunks of memory, we load a sparse set of memory operands (as 
many as the vector elements of the destination). Every one of those memory operands must obey the 
alignment rules; otherwise, a #GP fault will result if the related write-mask bit is not disabled (set to 0). 


A Vector Gather operates as follows: 


zmmQ@ <= U(mv) 
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where U(mv) is a set of up-converted memory operands described by a base address, a vector of indices 
and an immediate scale to apply for each index. Every one of those operands is conditionally written to 
destination vector zmm0 (based on the value of the write-mask). 


Vector Stores. A vector store reads a vector register source, performs a predefined store conversion function, 
and writes the result to the destination memory location on a per-element basis, with the writes enabled 
or disabled according to the corresponding bits of the vector mask register selected as the write-mask. 


The size of the memory destination is a function of the type of conversion associated with the mnemonic. 
We call this special memory operand a down-converted memory operand. Each memory destination 
operand must have an address that is aligned to the number of bytes of memory accessed by the operand 
(pre-conversion, if conversion is performed); otherwise, a #GP fault will result. 


A Vector Store operates as follows: 
m <= D(zmmQ) 


where zmm0 is the vector register source whose full contents are down-converted (denoted by D()), and 
written to memory. 


A special sub-case of these instructions are Vector Scatters. Vector Scatters are a special form of vector 
stores where, instead of writing the source vector into a consecutive chuck of memory, we store each 
vector element into a different memory location. Every one of those memory destinations must obey the 
alignment rules; otherwise, a #GP fault will result if the related write-mask bit is not disabled (set to 0). 


A Vector Scatter operates as follows: 
mv <= D(zmmQ) 


where zmm20 is the vector register source whose full or partial contents are down-converted (denoted 
by D()), and written to the set of memory locations mv, specified by a base address, a vector of indices 
and an immediate scale which is applied to every index. Every one of those down-converted elements are 
conditionally stored in the memory locations based on the value of the write-mask. 


2.1.3.3. Knights Corner instructions vector mask Instructions 


Vector mask instructions allow programmers to set, copy, or operate on the contents of a given vector mask. 
There are three types of vector mask instructions: 


¢ Mask read/write instructions: These instruction move data between a general-purpose integer register 
and a vector mask register, or between two vector mask registers. 


¢ Flag instructions: This category, consisting of instructions that modify EFLAGS based on vector mask 
registers, actually contains only one instruction, kortest. 


¢ Mask logical instructions: These instructions perform standard bitwise logical operations between vec- 
tor mask registers. 


2.1.3.4 Knights Corner instructions New Scalar Instructions 


In addition to vector, vector memory, and vector mask instructions, Knights Corner instructions adds a few scalar 
instructions as well. These instructions are useful for increasing the performance of some critical algorithms; 
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for example, any code that suffers reduced performance due to cache-miss latency can benefit from the new 
prefetch instructions. 


2.2 Knights Corner instructions Swizzles and Converts 


Data transformation, in the form of certain data conversions or element rearrangements (for loads, both at once) 
of one operand, can be performed for free as part of most Knights Corner instructions vector instructions. 


Three sorts of data transformations are available: 


¢ Data Conversions: Sources from memory can be converted to either 32 bit signed or unsigned integer or 
32 bit floating-point before being used. Supported data types in memory are float16, sint8, uint8, sint16, 
and uint16 for load-op instructions 


¢ Broadcast: If the source memory operand contains fewer than the total number of elements, it can be 
broadcast (repeated) to form the full number of elements of the effective source operand (16 for 32 bit 
instructions, 8 for 64 bit instructions). Broadcast can be combined with load-type conversions only; load- 
op instructions can do one or the other: either broadcast, or swizzle and/or up-conversion. There are two 
broadcast granularities: 


- 1-element granularity where the 1 element of the source memory operand are broadcast 16 times 
to form a full 16-element effective source operand (for 32 bit instructions), or 8 times to form a full 
8-element effective source operand (for 64 bit instructions). 


- 4-element granularity where the 4 elements of the source memory operand is broadcast 4 times 
to form a full 16-element effective source operand (for 32 bit instructions), or 2 times to form a full 
8-element effective source operand (for 64 bit instructions). 


Broadcast is very useful for instructions that mix vector and scalar sources, where one of the sources is 
common across the different operations. 


¢ Swizzles: Sources from registers can undergo swizzle transformations (that is, they can be permuted), 
although only 8 swizzles are available, all of which are limited to permuting within 4-element sets (either 
of 32 bits or 64 bits each). 


Knights Corner instructions also introduces the concept of Rounding Mode Override or Static (per instruc- 
tion) Rounding Mode, which efficiently supports the feature of determining the rounding mode for arithmetic 
operations on a per-instruction basis. Thus one can choose the rounding mode without having to perform costly 
MXCSR save-modify-restore operations. 


Knights Corner extends the swizzle functionality for register-register operands in order to provide rounding 
mode override capabilities for Knights Corner floating-point instructions instead of obeying the MXCSR.RC bits. 
All four rounding modes are available via swizzle attribute: Round-up, Round-down, Round-toward-zero and 
Round-to-nearest. The option is not available for instructions with memory operands. On top of these options, 
Knights Corner introduces the SAE (suppress-all-exceptions) attribute feature. An instruction with SAE set will 
not raise any kind of floating-point exception flags, independent of the inputs. 


In addition to those transformations, all Knights Corner instructions memory operands may have a special at- 
tribute, called the FH hint (eviction hint), that indicates to the processor that the data is non-temporal - that is, 
it is unlikely to be reused soon enough to benefit from caching in the 1st-level cache and should be given prior- 
ity for eviction. This is, however, a hint, and the processor may implement it in any way it chooses, including 
ignoring the hint entirely. 
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Table 2.1 shows the assembly language syntax used to indicate the presence or absence of the EH hint. 


B1 Function Usage Comment 

0 [eax] (no effect) regular memory operand 

1 EH [eax]{eh} memory operand with Non-Temporal (Evic- 
tion) hint 


Table 2.1: EH attribute syntax. 


Data transformations can only be performed on one source operand at most; for instructions that take two or 
three source operands, the other operands are always used unmodified, exactly as they're stored in their source 
registers. In no case do any of the Knights Corner instructions allow using data conversion and swizzling at 
the same time. Broadcasts, on the other hand, can be combined with data conversions when performing vector 
loads. 


Not all instructions can use all of the different data transformations. Load-op instructions (such as vector arith- 
metic instructions), vector loads, and vector stores have different data transformation capabilities. We can cat- 
egorize these transformation capabilities into three families: 


¢ Load-Op SwizzUpConv: For a register source, swizzle; fora memory operand, either: (a) broadcast, or (b) 
convert to 32 bit floats or 32 bit signed or unsigned integers. This is used by vector arithmetic instructions 
and other load-op instructions. There are two versions, one for 32 bit floating-point instructions and 
another for 32 bit integer instructions; in addition, the available data transformations differ for register 
and memory operands. 


¢ Load UpConv: Convert from a memory operand to 32 bit floats or 32 bit signed or unsigned integers; used 
by vector loads and broadcast instructions. For 32 bit floats, there are three different conversion tables 
based on three different input types. See Section 2.2.2, Load UpConvert. 
There is no load conversion support for 64 bit datatypes. 


¢ DownConv: Convert from 32 bit floats or 32 bit signed or unsigned integers to a memory operand; used by 
vector stores. For 32 bit floats, there are three different conversion tables based on three different output 
types. See Section 2.2.3, Down-Conversion. 
There is no store conversion support for 64 bit datatypes. 


2.2.1 lLoad-Op Swizzle/Convert 


Vector load-op instructions can swizzle, broadcast, or convert one of the sources; we will refer to this as the 
swizzle/convert source, and we will use SwizzUpConv to describe the swizzle/convert function itself. The avail- 
able SwizzUpConv transformations vary depending on whether the operand is memory or a register, and also 
in the case of conversions from memory depending on whether the vector instruction is 32 bit integer, 32 bit 
floating-point, 64 bit integer or 64 bit floating-point. 3 bits are used to select among the different options, so 
eight options are available in each case. 


When the swizzle/convert source is a register, SwizzUpConv allows the choice of one of eight swizzle primitives 
(one of the eight being the identity swizzle). These swizzle functions work on either 4-byte or 8-byte elements 
within 16-byte/32-byte boundaries. For 32 bit instructions, that means certain permutations of each set of four 
elements (16 bytes) are supported, replicated across the four sets of four elements. When the swizzle/convert 
source is a register, the functionality is the same for both integer and floating-point 32 bit instructions. Table 2.2 
shows the available register-source swizzle primitives. 
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S9S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element across 4-element packets | zmm0 {aaaa} 

101 broadcast b element across 4-element packets | zmm0 {bbbb} 

110 broadcast c element across 4-element packets | zmm0 {cccc} 

111 broadcast d element across 4-element packets | zmm0 {dddd} 


Table 2.2: 32 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 32 bit elements that 
form one 128-bit block in the source (with ‘a’ least significant and 'd' most significant), so aaaa means that the 
least significant element of the 128-bit block in the source is replicated to all four elements of the same 128- 
bit block in the destination; the depicted pattern is then repeated for all four 128-bit blocks in the source and 
destination. We use 'ponm lkji hgfe dcba' to denote a full Knights Corner instructions source register, where ‘a’ is 
the least significant element and 'p' is the most significant element. However, since each 128-bit block performs 
the same permutation for register swizzles, we only show the least significant block here. Note that in this table 
as well as in subsequent ones from this chapter S25)5o are bits 6-4 from MVEX prefix encoding (see Figure 3.3 


S551 Sp || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element across 4-element packets | zmm0 {aaaa} 

101 broadcast b element across 4-element packets | zmm0 {bbbb} 

110 broadcast c element across 4-element packets | zmm0 {cccc} 

111 broadcast d element across 4-element packets | zmm0 {dddd} 


Table 2.3: 64 bit Register SwizzUpConv swizzle primitives. Notation: dcba denotes the 64 bit elements that 
form one 256-bit block in the source (with ‘a’ least significant and 'd' most significant), so aaaa means that the 
least significant element of the 256-bit block in the source is replicated to all four elements of the same 256- 
bit block in the destination; the depicted pattern is then repeated for the two 256-bit blocks in the source and 
destination. We use 'hgfe dcba' to denote a full Knights Corner instructions source register, where ‘a’ is the least 
significant element and 'h' is the most significant element. However, since each 256-bit block performs the same 
permutation for register swizzles, we only show the least significant block here. 


For 64 bit instructions, that means certain permutations of each set of four elements (32 bytes) are supported, 
replicated across the two sets of four elements. When the swizzle/convert source is a register, the functionality 
is the same for both integer and floating-point 64 bit instructions. Table 2.3 shows the available register-source 
swizzle primitives. 


When the source is a memory location, load-op swizzle/convert can perform either no transformation, 2 differ- 
ent broadcasts, or four data conversions. Vector load-op instructions cannot both broadcast and perform data 
conversion at the same time. The conversions available differ depending on whether the associated vector in- 
struction is integer or floating-point, and whether the natural data type is 32 bit or 64 bit. (Note however that 
there are no load conversions for 64 bit destination data types.) 


Source memory operands may have sizes smaller than 64 bytes, expanding to the full 64 bytes of a vector source 
by means of either broadcasting (replication) or data conversion. 
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Each source memory operand must have an address that is aligned to the number of bytes of memory actually 
accessed by the operand (thatis, before conversion or broadcastis performed); otherwise, a #GP fault will result. 
Thus, for SwizzUpConv, any of 4-byte, 16-byte, 32-byte, or 64-byte alignment may be required. 


S515 |} Function: Usage 

000 no conversion [rax] 

001 broadcast 1 element (x16) | [rax] {1to16} 
010 broadcast 4 elements (x4) | [rax] {4to16} 
011 float16 to float32 [rax] {float16} 
100 uint8 to float32 [rax] {uint8} 
101 reserved N/A 

110 uint16 to float32 [rax] {uint16} 
111 sint16 to float32 [rax] {sint16 } 


Table 2.4: 32 bit Floating-point Load-op SwizzUpConv ;32 swizzle/conversion primitives. We use 'ponm lkji 
hgfe dcba' to denote a full Knights Corner instructions source register, with each letter referring to a 32 bit 
element, where ‘a’ is the least significant element and 'p' is the most significant element. So, for example, 'dcba 
dcba dcba dcba' shows that the source elements are copied to the destination by replicating the lower 128 bits 
of the source (the four least significant elements) to each 128-bit block of the destination. 


S'9$15Sp || Function: Usage 

000 no conversion [rax] {16to16} or [rax] 
001 broadcast 1 element (x16) | [rax] {1to16} 

010 broadcast 4 elements (x4) | [rax] {4to16} 

011 reserved N/A 

100 uint8 to uint32 [rax] {uint8} 

101 sint8 to sint32 [rax] {sint8} 

110 uint16 to uint32 [rax] {uint16} 

111 sint16 to sint32 [rax] {sint16 } 


Table 2.5: 32 bit Integer Load-op SwizzUpConv;32 (Doubleword) swizzle/conversion primitives. We use 
‘ponm Ikji hgfe dcba' to denote a full Knights Corner instructions source register, with each letter referring to a 
32 bit element, where ‘a’ is the least significant element and 'p' is the most significant element. So, for example, 
‘dcba dcba dcba dcba' shows that the source elements are copied to the destination by replicating the lower 128 
bits of the source (the four least significant elements) to each 128-bit block of the destination. 


Table 2.4 shows the available 32 bit floating-point swizzle primitives. 
SwizzUpConv conversions to float32s are exact. 

Table 2.5 shows the available 32 bit integer swizzle primitives. 

Table 2.6 shows the available 64 bit floating-point swizzle primitives. 


Finally, Table 2.7 shows the available 64 bit integer swizzle primitives. 


2.2.2 Load Up-convert 


Vector load/broadcast instructions can perform a wide array of data conversions on the data being read from 
memory, and can additionally broadcast (replicate) that data across the elements of the destination vector reg- 
ister depending on the instructions. The type of broadcast depends on the opcode/mnemonic being used. We 
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S'9S1Spo || Function: Usage 

000 no conversion [rax] {8to8} or [rax] 
001 broadcast 1 element (x8) [rax] {1to8} 

010 broadcast 4 elements (x2) | [rax] {4to8} 

011 reserved N/A 

100 reserved N/A 

101 reserved N/A 

110 reserved N/A 

111 reserved N/A 


Table 2.6: 64 bit Floating-point Load-op SwizzUpConv 7,4 swizzle/conversion primitives. We use 'hgfe dcba' 
to denote a full Knights Corner instructions source register, with each letter referring to a 64 bit element, where 
‘a’ is the least significant element and ‘h' is the most significant element. So, for example, 'dcba dcba' shows that 
the source elements are copied to the destination by replicating the lower 256 bits of the source (the four least 
significant elements) to each 256-bit block of the destination. 


S555 || Function: Usage 


000 no conversion [rax] {8to8} or [rax] 
001 broadcast 1 element (x8) [rax] {1to8} 

010 broadcast 4 elements (x2) | [rax] {4to8} 

011 reserved N/A 

100 reserved N/A 

101 reserved N/A 

110 reserved N/A 

111 reserved N/A 


Table 2.7: 64 bit Integer Load-op SwizzUpConv;,¢4 (Quadword) swizzle/conversion primitives. We use 'hgfe 
dcba' to denote a full Knights Corner instructions source register, with each letter referring to a 64 bit element, 
where ‘a’ is the least significant element and ‘h' is the most significant element. So, for example, 'dcba dcba' 
shows that the source elements are copied to the destination by replicating the lower 256 bits of the source (the 
four least significant elements) to each 256-bit block of the destination. 


will refer to this conversion process as up-conversion, and we will use UpConv to describe the load conversion 
function itself. 


Based on that, load instructions could be divided into the following categories: 


¢ regular loads: load 16 elements (32 bits) or 8 elements (64 bits), convert them and write into the destina- 
tion vector 


¢ broadcast 4-elements: load 4 elements, convert them (possible only for 32 bit data types), replicate them 
four times (32 bits) or two times (64 bits) and write into the destination vector 


¢ broadcast 1-element: load 1 element, convert it (possible only for 32 bit data types), replicate it 16 times 
(32 bits) or 8 times (64 bits) and write into the destination vector 


Therefore, unlike load-op swizzle/conversion, Load UpConv can perform both data conversion and broadcast 
simultaneously. We will refer to this process as up-conversion, and we will use Load UpConv to describe the load 
conversion function itself. 


When a broadcast 1-element is selected, the memory data, after data conversion, has a size of 4 bytes, and is 
broadcast 16 times across all 16 elements of the destination vector register. In other words, one vector element 
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is fetched from memory, converted to a 32 bit float or integer, and replicated to all 16 elements of the destination 
register. Using the notation where the contents of the source register are denoted {ponm lkji hgfe dcba}, with 
each letter referring to a 32 bit element (‘a' being the least significant element and 'p' being the most significant 
element), the source elements map to the destination register as follows: 


{aaaa aaaa aaaa aaaa} 


When broadcast 4-element is selected, the memory data, after data conversion, has a size of 16 bytes, and is 
broadcast 4 times across the four 128-bit sets of the destination vector register. In other words, four vector 
elements are fetched from memory, converted to four 32 bit floats or integers, and replicated to all four 4-element 
sets in the destination register. For this broadcast, the source elements map to the destination register as follows: 


{dcba dcba dcba dcba} 


Table 2.8 shows the different 32 bit Load up-conversion instructions in function of the broadcast function and 
the conversion datatype. Similarly, Table 2.10 shows the different 64 bit Load up-conversion instructions in 
function of the broadcast function and datatype. 


Datatype Load (16-element) Broadcast 4-element Broadcast 1-element 
INT32 (d) VMOVDQA32 VBROADCASTI32X4 VPBROADCASTD 
FP32 (ps) VMOVAPS VBROADCASTF32X4 VBROADCASTSS 


Table 2.8: 32 bit Load UpConv load/broadcast instructions per datatype. Elements may be 1, 2, or 4 bytes in 
memory prior to data conversion, after which they are always 4 bytes. We use 'ponm lkji hgfe dcba' to denote 
a full Knights Corner instructions source register, with each letter referring to a 32 bit element, where ‘a’ is the 
least significant element and 'p' is the most significant element. So, for example, 'dcba dcba dcba dcba' shows 
that the source elements are copied to the destination by replicating the lower 128 bits of the source (the four 
least significant elements) to each 128-bit block of the destination. 


As with SwizzUpConv, UpConv may have source memory operands with sizes smaller than 64-bytes, which are 
expanded to a full 64-byte vector by means of broadcast and/or data conversion. Each source memory operand 
must have an address that is aligned to the number of bytes of memory actually accessed by the operand (that 
is, before conversion or broadcast is performed); otherwise, a #GP fault will result. Thus, any of 1-byte, 2-byte, 
4-byte, 8-byte, 16-byte, 32-byte, or 64-byte alignment may be required. 


Table 2.9 shows the available data conversion primitives for 32 bit Load UpConv and for the different datatypes 
supported. 


Table 2.11 shows the 64 bit counterpart of Load UpConv. As shown, no 64 bit conversions are available but the 
pure "no-conversion" option. 


2.2.3 Down-Conversion 


Vector store instructions can perform a wide variety of data conversions to the data on the way to memory. 
We will refer to this process as down-conversion, and we will use DownConv to describe the store conversion 
function itself. 


DownConv may have destination memory operands with sizes smaller than 64 bytes, as a result of data conver- 
sion. Each destination memory operand must have an address that is aligned to the number of bytes of memory 
actually accessed by the operand (that is, after data conversion is performed); otherwise, a #GP fault will result. 
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UpConvj32 (INT32) 


S551Spo || Function: Usage 

000 no conversion [rax] 

001 reserved N/A 

010 reserved N/A 

011 reserved N/A 

100 uint8 to uint32 [rax] {uint8} 
101 sint8 to sint32 [rax] {sint8} 
110 uint16 to uint32 [rax] {uint16} 
111 sint16 to sint32 [rax] {sint16 } 


UpConv 32 (FP32) 


S559 || Function: Usage 

000 no conversion [rax] 

001 reserved N/A 

010 reserved N/A 

011 float16 to float32 [rax] {float16} 

100 uint8 to float32 [rax] {uint8} 

101 sint8 to float32 [rax] {sint8} 

110 uint16 to float32 [rax] {uint16} 

111 sint16 to float32 [rax] {sint16 } 

Table 2.9: 32 bit Load UpConv conversion primitives. 

Datatype Load Broadcast 4-element Broadcast 1-element 
INT64 (q) VMOVDQA64 VBROADCASTI64X4 VPBROADCASTQ 
FP64 (pd) VMOVAPD VBROADCASTF64X4 VBROADCASTSD 


Table 2.10: 64 bit Load UpConv load/broadcast instructions per datatype. Elements are always 8 bytes. We 
use 'hgfe dcba' to denote a full Knights Corner instructions source register, with each letter referring to a 64 
bit element, where ‘a’ is the least significant element and ‘'h' is the most significant element. So, for example, 
‘dcba dcba' shows that the source elements are copied to the destination by replicating the lower 256 bits of the 
source (the four least significant elements) to each 256-bit block of the destination. 


Thus, any of 1-byte, 2-byte, 4-byte, 8-byte, 16-byte, 32-byte, or 64-byte alignment may be required. 


Table 2.12 shows the available data conversion primitives for 32 bit DownConv and for the different supported 


datatypes. 


Table 2.13 shows the 64 bit counterpart of DownConv. As shown, no 64 bit conversions are available but the 
pure "no-conversion" option. 
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UpConvje4 (INT64) 


S'9$1Spo || Function: Usage 

000 no conversion [rax] {8to8} or [rax] 
001 reserved N/A 

010 reserved N/A 

011 reserved N/A 

100 reserved N/A 

101 reserved N/A 

110 reserved N/A 

111 reserved N/A 
UpConv ea (FP 64) 

S9S1Spo || Function: Usage 

000 no conversion [rax] {8to8} or [rax] 
001 reserved N/A 

010 reserved N/A 

011 reserved N/A 

100 reserved N/A 

101 reserved N/A 

110 reserved N/A 

111 reserved N/A 


Table 2.11: 64 bit Load UpConv conversion primitives. 


DownConv,32 (INT32) 


S255 || Function: Usage 

000 no conversion zmm1 

001 reserved N/A 

010 reserved N/A 

011 reserved N/A 

100 uint32 to uint8 zmm1 {uint8} 
101 sint32 to sint8 zmm1 {sint8} 
110 uint32 to uint16 zmm1 {uint16} 
111 sint32 to sint16 zmm1 {sint16 } 


DownConv 32 (FP32) 


S251Spo || Function: Usage 

000 no conversion zmm1 

001 reserved N/A 

010 reserved N/A 

011 float32 to float16 zmm1 {float16} 
100 float32 to uint8 zmm1 {uint8} 
101 float32 to sint8 zmm1 {sint8} 
110 float32 to uint16 zmm1 {uint16} 
111 float32 to sint16 zmm1 {sint16 } 


Table 2.12: 32 bit DownConv conversion primitives. Unless otherwise noted, all conversions from floating- 


point use MXCSR.RC 
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DownConvie4 (INT64) 


S251Sp || Function: Usage 
000 no conversion zmm1 
001 reserved N/A 
010 reserved N/A 
011 reserved N/A 
100 reserved N/A 
101 reserved N/A 
110 reserved N/A 
111 reserved N/A 


DownConv 64 (FP64) 


S255 || Function: Usage 
000 no conversion zmm1 
001 reserved N/A 
010 reserved N/A 
011 reserved N/A 
100 reserved N/A 
101 reserved N/A 
110 reserved N/A 
111 reserved N/A 


Table 2.13: 64 bit DownConv conversion primitives. 


2.3 Static Rounding Mode 


As described before, Knights Corner introduces a new instruction attribute on top of the normal register swizzles 
called Static (per instruction) Rounding Mode or Rounding Mode override. This attribute allows statically applying 
a specific arithmetic rounding mode ignoring the value of RM bits in MXCSR. 


Static Rounding Mode can be enabled in the encoding of the instruction by setting the FH bit to 1 in a register- 
register vector instruction. Table 2.14 shows the available rounding modes and their encoding. On top of the 
rounding-mode, Knights Corner also allows to set the SAE ("suppress-all-exceptions") attribute, to disable re- 
porting any floating-point exception flag on MXCSR. This option is available, even if the instruction does not 
perform any kind of rounding. 


Note that some instructions already allow to specify the rounding mode statically via immediate bits. In such 
case, the immediate bits take precedence over the swizzle-specified rounding mode (in the same way that they 
take precedence over the MXCSR.RC setting). 
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525159 || Rounding Mode Override Usage 
000 Round To Nearest (even) , {rn} 
001 Round Down (-INF) , {rd} 
010 Round Up (+INF) , {ru} 
011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 
101 Round Down (-INF) with SAE , {rd-sae} 
110 Round Up (+INF) with SAE , {ru-sae} 
111 Round Toward Zero with SAE , {rz-sae} 
1xx SAE , {sae} 


Table 2.14: Static Rounding-Mode Swizzle available modes plus SAE. 
2.4 Knights Corner Execution Environments 


Knights Corner's support for 32 bit and 64 bit execution environments are similar to those found in the Intel64® 
Intel® Architecture Software Developer's Manual. The 64 bit execution environment of Knights Corner is shown 
in Figure 2.1. The layout of 512-bit vector registers and vector mask registers are shown in Figure 2.2. This 
section describes new features associated with the 512-bit vector registers and the 16 bit vector mask registers. 


Knights Corner instructions defines two new sets of registers that hold the new vector state. The Knights Corner 
instructions extension uses the vector registers, the vector mask registers and/or the x86 64 general purpose 
registers. 


Knights Corner instructions Vector Registers. The 32 registers each store store 16 doubleword/single pre- 
cision floating-point entries (or 8 quadword/double precision floating-point entries), and serve as source 
and destination operands for vector packed floating point and integer operations. Additionally, they may 
also contain memory pointer offsets used to gather and scatter data from/to memory. These registers are 
referenced as zmm0 through zmm31. 


Vector Mask Registers. These registers specify which vector elements are operated on and written for Knights 
Corner instructions vector instructions. If the Nth bit of a vector mask register is set, then the Nth element 
of the destination vector is overridden with the result of the operation; otherwise, the element remains 
unchanged. A vector mask register can be set using vector compare instructions, instructions to move 
contents from a GP register, or a special subset of vector mask arithmetic instructions. 


Knights Corner vector instructions are able to report exceptions via MXCSR flags but never cause traps 
as all SIMD floating-point exceptions are always masked (unlike Intel® SSE/Intel® AVX instructions in 
other processors, that may trap if floating-point exceptions are unmasked, depending on the value of the 
OM/UM/IM/PM/DM/ZM bits). The reason is that Knights Corner forces the new DUE bit (Disable Un- 
masked Exceptions) in the MXCSR (bit21) to be set to 1. 


On Knights Corner, both single precision and double precision floating-point instructions use MXCSR.DAZ 
and MXCSR.FZ to decide whether to treat input denormals as zeros or to flush tiny results to zero (the 
latter are in most cases - but not always - denormal results which are flushed to zero when MXCSR.FZ is 
set to 1; see the IEEE Standard 754-2008, section 7.5, for a definition of tiny floating-point results). 


Table 2.15 shows the bit layout of the MXCSR control register. 


MXCSR bit 20 is reserved, however it is not reported as Reserved by MXCSR_MASK. Setting this bit will 
result in undefined behavior 


General-purpose registers. The sixteen general-purpose registers are available in Knights Corner's 64 bit 
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Basic Program Execution Registers Address Space 


Sixteen 64 bit 2464-1 
Registers General-Purpose Registers 
ae * me Segment Registers 
Registers 


RIP (Instruction Pointer Register) 


FPU Registers : 
Eight 80 bit 

Registers Floating-Point Data Registers 
Control Register 
Status Register 
Tag Register 

[ Opcode Register (11 bits) 
FPU Instruction Pointer Register 
FPU Data (Operand) Pointer Register 


Vector Registers 


Thirty-two 512 bit 


Vector Registers 


Registers 


Eight 16 bit | Vector Mask Registers 


Registers 


MXCSR Register 


Figure 2.1: 64 bit Execution Environment 
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Figure 2.2: Vector and Vector Mask Registers 
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mode execution environment. These registers are identical to those available in the 64 bit execution en- 
vironment described in the Intel64® Intel® Architecture Software Developer's Manual. 


EFLAGS register. R/EFLAGS are updated by instructions according to the Intel64® Intel® Architecture Soft- 
ware Developer's Manual. Additionally, it is also updated by Knights Corner's KORTEST instruction. 


FCW and FSW registers. Used by x87 instruction set extensions to set rounding modes, exception masks and 
flags in the case of the FCW, and to keep track of exceptions in the case of the FSW. 


x87 stack. An eight-element stack used to perform floating-point operations on 32/64/80-bit floating-point 
data using the x87 instruction set. 


Bit fields 


Field 


Reserved 


DUE 


Reserved 


FZ 


RC 


Reserved 


DAZ 


PE 


UE 


OE 


ZE 


DE 


IE 


Description 


Reserved bits 

Disable Unmasked Exceptions (always set to 1) 
Reserved bits 

Flush To Zero 

Rounding Control 

Reserved bits (IM/DM/ZM/OM/UM/PM in other proliferations) 
Denormals Are Zeros 

Precision Flag 

Underflow Flag 

Overflow Flag 

Divide-by-Zero Flag 

Denormal Operation Flag 

Invalid Operation Flag 


Table 2.15: MXCSR bit layout. Note: MXCSR bit 20 is reserved, however it is not reported as Reserved by 
MXCSR_MASK. Setting this bit will result in undefined behavior 
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Chapter 3 


Knights Corner Instruction Format 


This chapter describes the instruction encoding format and assembly instruction syntax of new instructions 
supported by Knights Corner. 


3.1 Overview 


Knights Corner introduces 512-bit vector instructions operating on 512-bit vector registers (zmm0-zmm31), 
and offers vector mask registers (k0-k7) to support a rich set of conditional operations on data elements within 
the zmm registers. Vector instructions operating on zmm registers are encoded using a multi-byte prefix encod- 
ing scheme, with 62H being the 1st of the multi-byte prefix. This multi-byte prefix is referred to as MVEX in this 
document. 


Instructions operating on the vector mask registers are encoded using another multi-byte prefix, with C4H or 
CSH being the 1st of the multi-byte prefix. This multi-byte prefix is similar to the VEX prefix that is defined in the 
"Intel® Architecture Instruction Set Architecture Programming Reference". We will refer to the C4H/C5H based 
VEX-like prefix as "VEX" in this document. Additionally, Knights Corner also provides a handful of new instruc- 
tions operating on general-purpose registers but are encoded using VEX. In some cases, new scalar instructions 
supported by Knights Corner can be encoded with either MVEX or VEX. 


3.2 Instruction Formats 


Instructions encoded by MVEX have the format shown in Figure 3.1. 


Instructions encoded by VEX have the format shown in Figure 3.2. 


3.2.1 MVEX/VEX and the LOCK prefix 


Any MVEX-encoded or VEX-encoded instruction with a LOCK prefix preceding the multi-byte prefix will generate 
an invalid opcode exception (#UD). 


40 Reference Number: 327364-001 


(intel 
CHAPTER 3. KNIGHTS CORNER INSTRUCTION FORMAT 


= — - 0.1 


2.4 


Figure 3.1: New Instruction Encoding Format with MVEX Prefix 
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Figure 3.2: New Instruction Encoding Format with VEX Prefix 


3.2.2 MVEX/VEX and the 66H, F2H, and F3H prefixes 


Any MVEX-encoded or VEX-encoded instruction with a 66H, F2H, or F3H prefix preceding the multi-byte prefix 
will generate an invalid opcode exception (#UD). 


3.2.3 MVEX/VEX and the REX prefix 


Any MVEX-encoded or VEX-encoded instruction with a REX prefix preceding the multi-byte prefix will generate 
an invalid opcode exception (#UD). 


3.3. The MVEX Prefix 


The MVEX prefix consists of four bytes that must lead with byte 62H. An MVEX-encoded instruction supports 
up to three operands in its syntax and is operating on vectors in vector registers or memory using a vector mask 
register to control the conditional processing of individual data elements in a vector. Swizzling, conversion and 
other operations on data elements within a vector can be encoded with bit fields in the MVEX prefix, as shown 
in Figure 3.3. The functionality of these bit fields is summarized below: 


¢ 64 bit mode register specifier encoding (R, X, B, R', W, V') for memory and vector register operands (en- 
coded in 1's complement form). 


- Avector register as source or destination operand is encoded by combining the R'R bits with the reg 
field, or the XB bits with the r/m field of the modR/M byte. 


- The base ofa memory operand is a general purpose register encoded by combining the B bit with the 
r/m field. The index of a memory operand is a general purpose register encoded by combining the X 
bit with the SIB.index field. 
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- The vector index operand in the gather/scatter instruction family is a vector register, encoded by 
combining the VX bits with the SIB.index field. MVEX.vvvv is not used in the gather/scatter instruc- 
tion family. 


Byte 0 Byte 1 Byte 2 
07654 3 


RXBR’: 64-bit mode register specifier associated with reg and n/m encoding in 1’s 
complement form. 


mmm: 


0001: implied OF leading opcode byte 
0010: imphed OF 38 leading opcode bytes 
0011: implied OF 3A leading opcode bytes 
0100-1111: Reserved for future use (will #UD) 
W: Opcode extension or 64-bit osize (operand size) promotion. 
V’vvwv: A non-destructive register specifier (m 1’s complement form) or 11111 if unused. 


pp: Compaction of 66/F2/F3 prefix 
00: None 
01: 66 
10: F3 
11: F2 


E: Non-temporal/eviction hint. 
SSS: Swizzle/broadcast/up-convert/down-convert/static-roundng controls. 
kkk: Vector mask register for masking control. 


Figure 3.3: MVEX bitfields 


¢ Non-destructive source register specifier (applicable to the three operand syntax): This is the first source 
operand in the three-operand instruction syntax. It is represented by the notation, MVEX.vvvv. It can 
encode any of the lower 16 zmm vector registers, or using the low 3 bits to encode a vector mask register 
as a source operand. It can be combined with V to encode any of the 32 zmm vector registers 


¢ Vector mask register and masking control: The MVEX.aaa field encodes a vector mask register that is 
used in controlling the conditional processing operation on the data elements of a 512-bit vector instruc- 
tion. The MVEX.aaa field does not encode a source or a destination operand. When the encoded value of 
MVExX.aaa is 000b, this corresponds to "no vector mask register will act as conditional mask for the vector 
instruction". 


¢ Non-temporal/eviction hint. The MVEX.E field can encode a hint to the processor on a memory referencing 
instruction that the data is non-temporal and can be prioritized for eviction. When an instruction encoding 
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does not reference any memory operand, this bit may also be used to control the function of the MVEX.SSS 
field. 


¢ Compaction of legacy prefixes (66H, F2H, F3H): This is encoded in the MVEX.pp field. 
¢ Compaction of two-byte and three-byte opcode: This is encoded in the MVEX.mmmm field. 


¢ Register swizzle/memory conversion operations (broadcast/up-convert/down-convert) /static-rounding 
override: This is encoded in the MVEX.SSS field. 


- Swizzle operation is supported only for register-register syntax of 512-bit vector instruction, and re- 
quires MVEX.E = 0, the encoding of MVEX.SSS determines the exact swizzle operation - see Section 2.2 


- Static rounding override only applies to register-register syntax of vector floating-point instructions, 
and requires MVEX.E = 1. 


The MVEX prefix is required to be the last prefix and immediately precedes the opcode bytes. 


3.3.1 Vector SIB (VSIB) Memory Addressing 


In the gather/scatter instruction family, an SIB byte that follows the ModR/M byte can support VSIB memory 
addressing to an array of linear addresses. VSIB memory addressing is supported only with the MVEX prefix. 


In VSIB memory addressing, the SIB byte consists of: 


¢ The scale field (bit 7:6), which specifies the scale factor. 


¢ The index field (bits 5:3), which is prepended with the 2-bit logical value of the MVEX.VX bits to specify 
the vector register number of the vector index operand; each element in the vector register specifies an 
index. 


¢ The base field (bits 2:0) is prepended with the logical value of MVEX.B field to specify the register number 
of the base register. 


3.4 The VEX Prefix 


The VEX prefix is encoded in either the two-byte form (the first byte must be C5H) or in the three-byte form (the 
first byte must be C4H). Beyond the first byte, the VEX prefix consists of a number of bit fields providing specific 
capability; they are shown in Figure 3.4. 

The functionality of the bit fields is summarized below: 


¢ 64 bit mode register specifier encoding (R, X, B, W): The R/X/B bit field is combined with the lower three 
bits or register operand encoding in the modR/M byte to access the upper half of the 16 registers available 
in 64 bit mode. The VEX.R, VEX.X, VEX.B fields replace the functionality of REX.R, REX.X, REX.B bit fields. 
The W bit either replaces the functionality of REX.W or serves as an opcode extension bit. The usage of the 
VEX.WRXB bits is explained in detail in section 2.2.1.2 of the Intel® 64 and IA-32 Architectures Software 
developer's manual, Volume 2A. This bit is stored in 1's complement form (bit inverted format). 
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¢ Non-destructive source register specifier (applicable to three operand syntax): this is the first source 
operand in the instruction syntax. It is represented by the notation, VEX.vvvv. It can encode any general- 
purpose register, or using only 3 bits it can encode vector mask registers. This field is encoded using 1's 
complement form (bit inverted form), i.e. RAX/KO is encoded as 1111B, and R15 is encoded as OOOOB. 


¢ Compaction of legacy prefixes (66H, F2H, F3H): This is encoded in the VEX.pp field. 
¢ Compaction of two-byte and three-byte opcode: This is encoded in the VEX.mmmmm field. 


The VEX prefix is required to be the last prefix and immediately precedes the opcode bytes. It must follow any 
other prefixes. If the VEX prefix is present a REX prefix is not supported. 


Byte 0 Byte 1 Byte 2 
(Bit Position) 7 07654 ee 
3 210 


RXB: 64-bit mode register specifier associated with reg and r/m operand encoding. 
m-mmmm: 

00000: Reserved for future use (will #UD) 

00001: implied OF leading opcode byte 

00010: implied OF 38 leading opcode bytes 


00011: implied OF 3A leading opcode bytes 
00100-11111: Reserved for future use (will UD) 


W: Opcode extension or 64-bit osize (operand size) promotion. 
vvvv: A non-destructive register specifier (in 1°s complement form) or 1111 if unused. 
pp: Compaction of 66/F2/F3 prefix 

00: None 

Ol: 66 

10: F3 

11: F2 


Figure 3.4: VEX bitfields 
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3.5 Knights Corner instructions Assembly Syntax 


Knights Corner instructions supports up to three operands. The rich encoding fields for swizzle/broadcast/convert/rounding, 
masking control, and non-temporal hint are expressed as modifier expressions to the respective operands in the 

assembly syntax. A few common forms for Knights Corner assembly instruction syntax are expressed in the gen- 

eral form: 


mnemonic vreg{masking modifier}, sourcel, transform_modifier(vreg/mem) 
mnemonic vreg{masking modifier}, sourcel, transform_modifier(vreg/mem), imm 
mnemonic mem{masking modifier}, transform_modifier(vreg) 


The specific forms to express assembly syntax operands, modifiers, and transformations are listed in Table 3.1. 


3.6 Notation 


The notation used to describe the operation of each instruction is given as a sequence of control and assignment 
statements in C-like syntax. This document only contains the notation specifically needed for vector instructions. 
Standard Intel® 64 notation may be found at IA-32 Intel® Architecture Software Developer's Manual: Volume 2 
for convenience. 


When instructions are represented symbolically, the following notations are used: 
label: mnemonic argument1 {write-mask}, argument2, argument3, argument4. ... 


where: 


¢ Amnemonic is a reserved name for a class of instruction opcodes which have the same function. 


¢ The operands argument1, argumentz2, argument3, argument4, and so on are optional. There may be from 
one to three register operands, depending on the opcode. The leftmost operand is always the destina- 
tion; for certain instructions, such as vfmadd231ps, it may be a source as well. When the second leftmost 
operand is a vector mask register, it may in certain cases be a destination as well, as for example with the 
vpsubrsetbd instruction. All other register operands are sources. There may also be additional arguments 
in the form of immediate operands; for example, the vcvtfxpntdq2ps instructions has a 3-bit immediate 
field that specifies the exponent adjustment to be performed, if any. The write-mask operand specifies the 
vector mask mask register used to control the selective updating of elements in the destination register or 
registers. 


3.6.1 Operand Notation 


In this manual we will consider vector registers from several perspectives. One perspective is is as an array of 
64 bytes. Another is as an array of 16 doubleword elements. Another is an array of 8 quadword elements. Yet 
another is as an array of 512 bits. In the mnemonic operation description pseudo-code, registers will be ad- 
dressed using bit ranges, such as: 


i = n*32 
zmm1Ci+31:i] 
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This example refers to the 32 bits of the n-th doubleword element of vector register zmm1. 


We will use a similar bit-oriented notation to describe access to vector mask registers. In the case of vector mask 
registers, we will usually specify a single bit, rather than a range of bits, because vector mask registers are used 
for predication, carry, borrow, and comparison results, and a single bit per element is enough for any of those 
purposes. 


Using this notation, it is for example possible to test the value of the 12°” bit in k1 as follows: 


if ( k1[11] == 1) { ... code here ... } 


Tables 3.1 and 3.2 summarize the notation used for instruction operands and their values. 


In Knights Corner instructions, the contents of vector registers are variously interpreted as floating-point values 
(either 32 or 64 bits), integer values, or simply doubleword values of no particular data type, depending on the 
instruction semantics. 


3.6.2 The Displacement Bytes 


Knights Corner introduces a brand new displacement representation that allows for a more compact encoding 
in unrolled code: compressed displacement of 8-bits, or disp8*N. Such compressed displacement is based on 
the assumption that the effective displacement is a multiple of the granularity of the memory access, and hence 
we do not need to encode the redundant low-order bits of the address offset. 


Knights Corner instructions using the MVEX prefix (i.e. using encoding 62) have the following displacement 
options: 
¢ No displacement 


¢ 32 bit displacement: this displacement works exactly the same as the legacy 32 bit displacement and 
works at byte granularity 


Compressed 8 bit displacement (disp8*N): this displacement format substitutes the legacy 8-bit displace- 
ment in Knights Corner instructions using map 62. This displacement assumes the same granularity as 
the memory operand size (which is dependent on the instructions and the memory conversion function 
being used). Redundant low-order bits are ignored and hence, 8-bit displacements are reinterpreted so 
that they are multiplied by the memory operands total size in order to generate the final displacement to 
be used in calculating the effective address. 


Note that the displacements in the MVEX vector instruction prefix are encoded in exactly the same way as regular 
displacements (so there are no changes in the ModRM/SIB encoding rules), with the only exception that disp8 is 
overloaded to disp8*N. In other words there are no changes in the encoding rules or encoding lengths, but only 
in the interpretation of the displacement value by hardware (which needs to scale the displacement by the size 
of the memory operand to obtain a byte-wise address offset). 


3.6.3. Memory size and disp8*N calculation 


Table 3.3 and Table 3.4 show the size of the vector (or element) being accessed in memory, which is equal to the 
scaling factor for compressed displacement (disp8*N). Note that some instructions work at element granularity 
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Notation Meaning 

zmm1 A vector register operand in the argument! field of the instruction. The 64 byte 
vector registers are: zmm0 through zmm31 

zmm2 A vector register operand in the argumentz field of the instruction. The 64 byte 
vector registers are: zmm0 through zmm31 

zmm3 A vector register operand in the arguments3 field of the instruction. The 64 byte 
vector registers are: zmm0 through zmm31 

S'p32(zmm/m) A vector floating-point 32 bit swizzle/conversion. Refer to Table 2.2 for register 
sources and Table 2.4 for memory conversions. 

S'r64(zmm/m) A vector floating-point 64 bit swizzle/conversion. Refer to Table 2.3 for register 
sources and Table 2.6 for memory conversions. 

Si32(zmm/m) A vector integer 32 bit swizzle/conversion. Refer to Table 2.2 for register sources 
and Table 2.5 for memory conversions. 

Sie4(zmm/m) A vector integer 64 bit swizzle/conversion. Refer to Table 2.3 for register sources 
and Table 2.7 for memory conversions. 

Uy32(m) A floating-point 32 bit load Up-conversion. Refer to Table 2.9 for the memory 
conversions available for all the different datatypes. 

U;32(m) An integer 32 bit load Up-conversion. Refer to Table 2.9 for the memory conver- 
sions available for all the different datatypes. 

Uyea(m) A floating-point 64 bit load Up-conversion. Refer to Table 2.11 for the memory 
conversions available for all the different datatypes. 

Ui64(m) An integer 64 bit load Up-conversion. Refer to Table 2.11 for the memory conver- 
sions available for all the different datatypes. 

D 32(zmm) A floating-point 32 bit store Down-conversion. Refer to Table 2.12 for the memory 
conversions available for all the different datatypes. 

Di32(zmm) An integer 32 bit store Down-conversion. Refer to Table 2.12 for the memory 
conversions available for all the different datatypes. 

D 64(zmm) A floating-point 64 bit store Down-conversion. Refer to Table 2.13 for the memory 
conversions available for all the different datatypes. 

Djga(zmm) An integer 64 bit store Down-conversion. Refer to Table 2.13 for the memory 
conversions available for all the different datatypes. 

m A memory operand. 

me A memory operand that may have an EH hint attribute. 

MuvU¢t A vector memory operand that may have an EH hint attribute. This memory 


effective_address 


operand is encoded using ModRM and VSIB bytes. It can be seen as a set of point- 
ers where each pointer is equal to BASE + VINDEX(i] x SCALE 
Used to denote the full effective address when dealing with a memory operand. 


imm8 An immediate byte value. 

SRC[a-b] A bit-field from an operand ranging from LSB b to MSB a. 
Table 3.1: Operand Notation 

Notation Meaning 


zmm1[i+31:i] 
zmm2[i+31:i] 


k1 [i] 


The value of the element located between bit 7 and bit 7 + 31 of the argument1 
vector operand. 
The value of the element located between bit 7 and bit 2 + 31 of the argument2 
vector operand. 

Specifies the i-th bit in the vector mask register k1. 


Table 3.2: Vector Operand Value Notation 
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instead of full vector granularity at memory level, and hence should use the "element level" column in Table 3.3 
and Table 3.4 (namely VIOADUNPACK, VPACKSTORE, VGATHER, and VSCATTER instructions). 
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Table 3.3: Size of vector or element accessed in memory for up- 
conversion 


Function || Usage Memory accessed / Disp8*N 
U/S p32 Nobroadcast 4to16 broadcast 1to16 broadcast 
or element level 
000 [rax] {16to16} or [rax] 64 16 4 
001 [rax] {1to16} 4 NA NA 
010 [rax] {4to16} 16 NA NA 
011 [rax] {float16} 32 8 2 
100 [rax] {uint8} 16 4 1 
101 [rax] {sint8} 16 4 1 
110 [rax] {uint16} 32 8 2 
111 [rax] {sint16} 32 8 2 
U/S;32 No broadcast 4to16 broadcast 1to16 broadcast 
or element level 
000 [rax] {16to16} or [rax] 64 16 4 
001 [rax] {1to16} 4 NA NA 
010 [rax] {4to16} 16 NA NA 
011 N/A NA NA NA 
100 [rax] {uint8} 16 4 1 
101 [rax] {sint8} 16 4 1 
110 [rax] {uint16} 32 8 2 
111 [rax] {sint16} 32 8 2 
U/S sea Nobroadcast 4to8broadcast  1to8 broadcast 
or element level 
000 [rax] {8to8} or [rax] 64 32 8 
001 [rax] {1to8} 8 NA NA 
010 [rax] {4to8} 32 NA NA 
011 N/A NA NA NA 
100 N/A NA NA NA 
101 N/A NA NA NA 
110 N/A NA NA NA 
111 N/A NA NA NA 
U/Siea No broadcast 4to8 broadcast 1to8 broadcast 
or element level 
000 [rax] {8to8} or [rax] 64 32 8 
001 [rax] {1to8} 8 NA NA 
010 [rax] {4to8} 32 NA NA 
011 N/A NA NA NA 
100 N/A NA NA NA 
101 N/A NA NA NA 
110 N/A NA NA NA 
111 N/A NA NA NA 
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Table 3.4: Size of vector or element accessed in memory for down- 


conversion 

Function || Usage Memory accessed / Disp8*N 

De32 Regular store Element level 
000 zmm1 64 4 
001 N/A NA NA 
010 N/A NA NA 
011 zmm1 {float16} 32 2 
100 zmmz1 {uint8} 16 1 
101 zmm1 {sint8} 16 1 
110 zmm1 {uint16} 32 2 
111 zmm1 {sint16} 32 2 
Drea Regular store Element level 
000 zmm1 64 8 
001 N/A NA NA 
010 N/A NA NA 
011 N/A NA NA 
100 N/A NA NA 
101 N/A NA NA 
110 N/A NA NA 
111 N/A NA NA 
Diea Regular store Element level 
000 zmm1 64 8 
001 N/A NA NA 
010 N/A NA NA 
011 N/A NA NA 
100 N/A NA NA 
101 N/A NA NA 
110 N/A NA NA 
111 N/A NA NA 


3.7 EH hint 


All vector instructions that access memory provide the option of specifying a cache-line eviction hint, EH. 


EH is a performance hint, and may operate in different ways or even be completely ignored in different hardware 
implementations. Knights Corner is designed to provide support for cache-efficient access to memory locations 
that have either low temporal locality of access or bursts of a few very closely bunched accesses. 


There are two distinct modes of EH hint operation, one for prefetching and one for loads, stores, and load-op 
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instructions. 


The interaction of the EH hint with prefetching is summarized in Table 3.5. 


EH value Hit behavior Miss behavior 
EH not set | Make data MRU | Fetch data and make it MRU 
EH set Make data MRU | Fetch data into way #N, where N is the 
thread number, and make it MRU 


Table 3.5: Prefetch behavior based on the EH (cache-line eviction hint) 


The above table describes the effect of the EH bit on gather/scatter prefetches into the targeted cache (e.g. L1 
for vgatherpfOdps, L2 for vgatherpfldps). If vgatherpfOdps misses both L1 and L2, the resulting prefetch into L1 
is anon-temporal prefetch into way #N of L1, but the prefetch into L2 is a normal prefetch, not a non-temporal 
prefetch. If you want the data to be non-temporally fetched into L2, you must use vgatherpfldps with the EH bit 
set. 


The operation of the EH hint with prefetching is designed to limit the cache impact of streaming data. 


Note that regular prefetch instructions (like vprefetchO) do not have an embedded EH hint. Instead, the non- 
temporal hint is given by the opcode/mnemonic (see VPREFETCHNTA/0/1/2 descriptions for details). The 
same rules described in Table 3.5 still apply. 


Table 3.6 summarizes the interaction of the EH hint with load and load-op instructions. 


EH value L1 hit behavior | L1 miss behavior 
EH not set | Make data MRU | Fetch data and make it MRU 
EH set Make data LRU | Fetch data and make it MRU 


Table 3.6: Load/load-op behavior based on the EH bit. 


The EH bit, when used with load and load-op instructions, affects only the L1 cache behavior. Any resulting L2 
misses are handled normally, regardless of the setting of the EH bit. 


Table 3.7 summarizes the interaction of the EH hint with store instructions. Note that stores that write a full 
cache-line (no mask, no down-conversion) evict the line from L1 (invalidation) while updating the contents 
directly into the L2 cache. In any other case, a store with an EH hint works as a load with an EH hint. 


EH value Store type L1 hit behavior L1 miss behavior 

EH not set Make data MRU Fetch data and make it MRU 
EH set No mask, no downconv. | Invalidate L1 - Update L2 | Fetch data and make it MRU 
EH set Mask or downconv. Make data LRU Fetch data and make it MRU 


Table 3.7: Store behavior based on the EH bit. 


The EH bit, when used with load and load-op instructions, affects only the L1 cache behavior. Any resulting L2 
misses are handled normally, regardless of the setting of the EH bit. 
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3.8 Functions and Tables Used 


Some mnemonic definitions use auxiliary tables and functions to ease the process of describing the operations of 
the instruction. The following section describes those tables and functions that do not have an obvious meaning. 


3.8.1 MemlLoad and MemStore 


This document uses two functions, Mem-Load and MemStore, to describe in pseudo-code memory transfers that 
involve no conversions or broadcasts: 


¢ MemLoad: Given an address pointer, this function returns the associated data from memory. Size is de- 
fined by the explicit destination size in the pseudo-code (see for example LDMXCSR in Appendix B) 


¢ MemStore: Given an address pointer, this function stores the associated data to memory. Size is defined 
by the explicit source data size in the pseudo-code. 


3.8.2 SwizzUpConvLload, UpConvLoad and DownConvStore 


In this document, the detailed discussions of memory-accessing instructions that support datatype conversion 
and/or broadcast (as defined by the UpConv, SwizzUpConv, and DownConv tables in section 2.2) use the func- 
tions shown in Table 3.8 in their Operation sections (the instruction pseudo-code). These functions are used 
to describe any swizzle, broadcast, and/or conversion that can be performed by the instruction, as well as the 
actual load in the case of SwizzUpConv and UpConv. Note that zmm/m means that the source may be either a 
vector operand or a memory operand, depending on the ModR/M encoding. 


Swizzle/conversion used Function used in operation description 
S32(zmm/m) SwizzUpConvLoad ¢32(zmm/m) 
Srea(zmm/m) SwizzUpConvLoad ¢g4(zmm/m) 
Size(zmm/m) SwizzUpConvLoad;32(zmm/m) 
Sigs(zmm/m) SwizzUpConvLoadjg4(zmm/m) 
Ur32(m) UpConvLoad ¢32(m) 
Ui32(m) UpConvLoad;32(m) 
Usea(m) UpConvLoad fg4(m) 
Uiea(m) UpConvLoad;g4(m) 
Dy32(zmm) DownConvStore¢32(zmm) or DownConvStore ¢32(zmm[xx:yy]) 
Diz2(zmm) DownConvStore;32(zmm) or DownConvStore;32(zmm[xx:yy]) 
Dyea(zmm) DownConvStore sg4(zmm) or DownConvStore e4(zmm[xx:yy]) 
Digs(zmm) DownConvStore;g4(zmm) or DownConvStore;g4(zmm[xx:yy]) 


Table 3.8: SwizzUpConv, UpConv and DownConv function conventions 


The Operation section may use UpConvSizeOf, which returns the final size (in bytes) of an up-converted memory 
element given a specified up-conversion mode. A specific subset of amemory stream may be used as a parameter 
for UpConv; Size of the subset is inferred by the size of destination together with the up-conversion mode. 


Additionally, the Operation section may also use DownConvStoreSizeOf, which returns the final size (in bytes) of 
a downcoverted vector element given a specified down-conversion mode. A specific subset of a vector register 
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may be used as a parameter for DownConvsStore; for example, DownConvStore(zmm2[31:0]) specifies that the 
low 32 bits of zmm2 form the parameter for DownConv. 


3.8.3 Other Functions/Identifiers 


The following identifiers are used in the algorithmic descriptions: 


Carry - The carry bit from an addition. 


FpMaxAbs - The greater of the absolute values of two floating-point numbers. See the description of the 
VGMAXABSPS instruction for further details. 


FpMax - The greater of two floating-point numbers. See the description of the V@MAXPS instruction for 
further details. 


FpMin - The lesser of two floating-point numbers. See the description of the VGMINPS instruction for 
further details. 


Abs - The absolute value of a number. 


IMax - The greater of two signed integer numbers. 


UMax - The greater of two unsigned integer numbers. 


IMin - The lesser of two signed integer numbers. 


UMin - The lesser of two unsigned integer numbers. 


CvtInt32ToFloat32 - Convert a signed 32 bit integer number to a 32 bit floating-point number. 


CvtInt32 ToFloat64 - Convert a signed 32 bit integer number to a 64 bit floating-point number. 


CvtFloat32ToInt32 - Convert a 32 bit floating-point number to a 32 bit signed integer number using the 
specified rounding mode. 


CvtFloat64ToInt32 - Convert a 64 bit floating-point number to a 32 bit signed integer number using the 
specified rounding mode. 


CvtFloat32ToUint32 - Convert a 32 bit floating-point number to a 32 bit unsigned integer number using 
the specified rounding mode. 


CvtFloat64ToUint32 - Convert a 64 bit floating-point number to a 32 bit unsigned integer number using 
the specified rounding mode. 


CvtFloat32ToFloat64 - Convert a 32 bit floating-point number to a 64 bit floating-point number. 


CvtFloat64ToFloat32 - Convert a 64 bit floating-point number to a 32 bit floating-point number using 
the specified rounding mode. 


CvtUint32ToFloat32 - Convert an unsigned 32 bit integer number to a 32 bit floating-point number. 


CvtUint32ToFloat64 - Convert an unsigned 32 bit integer number to a 64 bit floating-point number. 


Exp2 - Performs a table lookup to obtain the floating-point value of Exp2 for a 5-bit fixed point number in 
the range [0, 1). See the description of the VEXP2LUTPS instruction for further details. 
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GetExp - Obtains the (un-biased) exponent of a given floating-point number, returned in the form of a 32 
bit floating-point number. See the description of the VGETEXPPS instruction for further details. 


Log2 - Performs a table lookup to obtain the floating-point value of the Log2 of the 6 most significant bits 
of the mantissa of a 32 bit floating point number. See the description of the VLOG2LUTPS instruction for 
further details. 


Rcp - Performs a table lookup to obtain an approximation of the reciprocal of a 32 bit floating-point num- 
ber. See the description of the VRCPREFINEPS instruction for further details. 


RoundTolInt - Rounds a floating-point number to the nearest integer, using the specified rounding mode. 
The result is a floating-point representation of the rounded integer value. 


RSqrt - Performs a table lookup to obtain an approximation of the reciprocal square root of a 32 bit 
floating-point number. See the description of the VRSQRTLUTPS instruction for further details. 


Borrow - The borrow bit from a subtraction. 
ZeroExtend - Returns a value zero-extended to the operand-size attribute of the instruction. 
FlushL1CacheLine - Flushes the cache line containing the specified memory address from L1. 


InvalidateCacheLine - Invalidate the cache line containing the specified memory address from the whole 
memory cache hierarchy. 


FetchL1CacheLine - Prefetches the cache line containing the specified memory address into L1. See the 
description of the VPREFETCH1 instruction for further details. 


FetchL2CacheLine - Prefetches the cache line containing the specified memory address into L2. See the 
description of the VPREFETCH2 instruction for further details. 
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Chapter 4 


Floating-Point Environment, Memory Ad- 
dressing, and Processor State 


This chapter describes the Knights Corner vector floating-point instruction exception behavior and interactions 
related to system programming. 


4.1 Overview 


Knights Corner 512-bit vector instructions that operate on floating-point data may signal exceptions related to 
arithmetic processing. When SIMD floating-point exceptions occur, Knights Corner supports exception report- 
ing using exception flags in the MXCSR register, but traps (unmasked exceptions) are not supported. 


Exceptions caused by memory accesses apply to vector floating-point, vector integer, and scalar instructions. 


The MXCSR register (see Figure 4.1) in Knights Corner provides: 


¢ Exception flags to indicate SIMD floating-point exceptions signaled by floating-point instructions operat- 
ing on zmm registers. The flags are: IE, DE, ZE, OE, UE, PE. 


¢ Rounding behavior and control: DAZ, FZ and RC. 


e Exception Suppression: DUE (always 1) 


4.1.1 Suppress All Exceptions Attribute (SAE) 


Knights Corner instructions that process floating-point data support a specific feature to disable floating-point 
exception signaling, called SAE ("suppress all exceptions"). The SAE mode is enabled via a specific bit in the 
register swizzle field of the MVEX prefix (by setting the EH bit to 1). When SAE is enabled in the instruction 
encoding, that instruction does not report any SIMD floating-point exception in the MXCSR register. This feature 
is only available to the register-register format of the instructions and in combination with static rounding-mode. 
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: Reserved Reserved |> Reserved 
: Disable Unmasked Exceptions 
: Flush to Zero 


: Rounding Control 

i Denormals Are Zeros 
: Precision Flag 

: Underflow Flag 

: Overflow Flag 

: Divide-by-Zero Flag 

: Denormal Flag 

i Invalid Operation Flag 


Figure 4.1: MXCSR Control/Status Register 


4.1.2 SIMD Floating-Point Exceptions 


SIMD floating-point exceptions are those exceptions that can be generated by Knights Corner instructions that 
operate on floating-point data in zmm operands. Six classes of SIMD floating-point exception flags can be sig- 
naled: 


¢ Invalid operation (#]) 

¢ Divide-by-zero (#Z) 

¢ Numeric overflow (#0) 

e¢ Numeric underflow (#U) 

e Inexact result (Precision) (#P) 


¢ Denormal operand (#D) 


4.1.3 SIMD Floating-Point Exception Conditions 


The following sections describe the conditions that cause SIMD floating-point exceptions to be signaled, and the 
masked response of the processor when these conditions are detected. 


When more than one exception is encountered, then the following precedence rules are applied?. 


1. Invalid-operation exception caused by sNaN operand 


2. Any other invalid exception condition different from sNaN input operand 


1Note that Knights Corner has no support for unmasked exceptions, so in this case the exception precedence rules have no effect. All 
concurrently-encountered exceptions will be reported simultaneously. 
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3. Denormal operand exception 
4. A divide-by-zero exception 
5. Overflow/underflow exception 


6. Inexact result 


All Knights Corner instructions floating-point exceptions are precise and are reported as soon as the instruction 
completes execution. The status flags from the MXCSR register set by each instruction will be the logical OR of 
the flags set by each of the up to 16 (or 8) individual operations. The status flags are sticky and can be cleared 
only via a LDMXCSR instruction. 


4.1.3.1 Invalid Operation Exception (#1) 
The floating-point invalid-operation exception (#1) occurs in response to an invalid arithmetic operand. The flag 
(IE) and mask (IM) bits for the invalid operation exception are bits 0 and 7, respectively, in the MXCSR register. 


Knights Corner instructions forces all floating-point exceptions, including invalid-operation exceptions, to be 
masked. Thus, for the #I exception the value returned in the destination register is a QNaN, QNaN Indefinite, 
Integer Indefinite, or one of the source operands. When a value is returned to the destination operand, it over- 
writes the destination register specified by the instruction. Table 4.1 lists the invalid-arithmetic operations that 
the processor detects for instructions and the masked responses to these operations. 


Normally, when one or more of the source operands are QNaNs (and neither is an SNaN or in an unsupported 
format), an invalid-operation exception is not generated. For VCMPPS and VCMPPD when the predicate is one 
of It, le, nlt, or nle, a QNaN source operand does generate an invalid-operation exception. 


Note that divide-by-zero exceptions (like all other floating-point exceptions) are always masked in Knights Cor- 
ner. 


4.1.3.2 Divide-By-Zero Exception (#Z) 


The processor reports a divide-by-zero exception when a VRCP23PS instruction has a 0 operand. 


Note that divide-by-zero exceptions (like all other floating-point exceptions) are always masked in Knights Cor- 
ner. 


4.1.3.3. Denormal Operand Exception (#D) 


The processor reports a denormal operand exception when an arithmetic instruction attempts to operate on a 
denormal operand and the DAZ bit in the MXCSR (the "Denormals Are Zero" bit) is not set to 0 (so that denormal 
operands are not treated as zeros). 


Note that denormal exceptions (like all other floating-point exceptions) are always masked in Knights Corner. 
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Condition 


Masked Response 


VADDNPD, VADDNPS, VADDPD, VADDPS, VADDSETSPS, VMULPD, 
VMULPS, VRCP23PS, VRSQRT32PS, VLOG2PS, VSCALEPS, VSUBPD, 
VSUBPS, VSUBRPD or an VSUBRPS instruction with an SNaN 
operand 


Return the SNaN converted to a QNaN. 
For more detailed information refer to 
Table 4.3 


VCMPPD or VCMPPS with QNaN or SNaN operand 


Return 0 (except for the predicates not- 
equal, unordered, not-less-than, or not- 
less- than-or-equal, which return a 1) 


VCVTPD2PS, or VCVTPS2PD instruction with an SNaN operand 


Return the SNaN converted to a QNaN. 


VCVTFXPNTPD2DQ, VCVTFXPNTPD2UDQ, VCVTFXPNTPS2DQ, or 
VCVTFXPNTPS2DQ instruction with an NaN operand 


Return a 0. 


VGATHERD, VMOVAPS, VLOADUNPACKHPS, VLOADUNPACKLPS, or 
VBROADCATSS instruction with SNaN operand and selected Up- 
Conv32 that converts from floating-point to another floating-point 
data type 


Return the SNaN converted to a QNaN. 


VPACKSTOREHPS, VPACKSTORELPS, VSCATTERDPS, or VMOVAPS 
instruction with SNaN operand and selected a DownConv32 that 
converts from float to another float datatype 


Return the SNaN converted to a QNaN. 


VFMADD132PD, VFMADD132PS, VFMADD213PD, VFMADD213PS, 
VFMADD231PD, VFMADD233PS, VFNMSUB132PD, VFNM- 
SUB132PS, VFNMSUB213PD, VFNMSUB213PS, VFNMSUB231PD, 
VNMSUB231PS, VFMSUB132PD, VFMSUB132PS, VFMSUB213PD, 
VFMSUB213PS, VFMSUB231PD, VFMSUB231PS, VFNMADD132PD, 
VFNMADD132PS, VFNMADD213PD, VFNMADD213PS, VFN- 
MADD231PD, or VFNMADD231PS instruction with an SNaN 
operand. 


Follow rules described in Table 4.4. 


VGMAXPD, VGMAXPS, VGMINPD or VGMINPS instruction with SNaN 
operand 


Returns non NaN operand. If both 
operands are NaN, return first source 
NaN. 


VGMAXABSPS instruction with SNaN operand. 


Returns non NaN operand. If both 
operands are NaN, return first source 
NaN with its sign bit cleared. 


Multiplication of infinity by zero 


Return the QNaN floating-point Indefi- 
nite. 


VGETEXPPS, VRCP23PS, VRSQRT23PS or VRNDFXPNTPS instruc- 
tion with SNaN operand 


Return the SNaN converted to a QNaN. 


VRSQRT23PS instruction with NaN or negative value 


Return the QNaN floating-point Indefi- 
nite. 


Addition of opposite signed infinities or subtraction of like-signed 
infinities 


Return the QNaN floating-point Indefi- 
nite 


Table 4.1: Masked Responses of Knights Corner instructions to Invalid Arithmetic Operations 


4.1.3.4 Numeric Overflow Exception (#0) 


The processor reports a numeric overflow exception whenever the rounded result of an arithmetic instruction 
exceeds the largest allowable finite value that fits in the destination operand. 


Note that overflow exceptions (like all other floating-point exceptions) are always masked in Knights Corner. 
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4.1.3.5 Numeric Underflow Exception (#U) 


The processor signals an underflow exception whenever (a) the rounded result of an arithmetic instruction, 
calculated assuming unbounded exponent, is less than the smallest possible normalized finite value that will fit 
in the destination operand (the result is tiny), and (b) the final rounded result, calculated with bounded exponent 
determined by the destination format, is inexact. 


Note that underflow exceptions (like all other floating-point exceptions) are always masked in Knights Corner. 


The flush-to-zero control bit provides an additional option for handling numeric underflow exceptions in 
Knights Corner. If set (FZ = 1), tiny results (these are usually, but not always, denormal values) are replaced 
by zeros of the same sign. If not set (FZ=0) then tiny results will be rounded to 0, a denormalized value, or the 
smallest normalized floating-point number in the destination format, with the sign of the exact result. 


4.1.3.6 Inexact Result (Precision) Exception (#P) 


The inexact-result exception (also called the precision exception) occurs if the result of an operation is not exactly 
representable in the destination format. For example, the fraction 1/3 cannot be precisely represented in binary 
form. This exception occurs frequently and indicates that some (normally acceptable) accuracy has been lost. 
The exception is supported for applications that need to perform exact arithmetic only. In flush-to-zero mode, 
the inexact result exception is signaled for any tiny result. (By definition, tiny results are not zero, and are flushed 
to zero when MXCSR.FZ = 1 for all instructions that support this mode.) 


Note that inexact exceptions (like all other floating-point exceptions) are always masked in Knights Corner. 


4.2 Denormal Flushing Control 


4.2.1 Denormal control in up-conversions and down-conversions 


Instruction up-conversions and down-conversions follow specific denormal flushing rules, i.e. for treating input 
denormals as zeros and for flushing tiny results to zero: 


4.2.1.1 Up-conversions 


¢ Up-conversions from float16 to float32 ignore the MXCSR.DAZ setting and this never treat input denormals 
as zeros. Denormal exceptions are never signaled (the MXCSR.DE flag is never set by these operations). 


¢ Up-conversions from any small floating-point number (namely, float16) to float32 can never generate a 
float32 output denormal 


4.2.1.2 Down-conversions 


¢ Down-conversions from float32 to float16 follow the MXCSR.DAZ setting to decide whether to treat input 
denormals as zeros or not. For input denormals, the MXCSR.DE flag is set only if MXCSR.DAZ is not set, 
otherwise it is left unchanged. 
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¢ Down-conversions from float32 to any integer format follow the MXCSR.DAZ setting to decide whether to 
treat input denormals as zeros or not (this may matter only in directed rounding modes). The MXCSR.DE 
status flag is never set. 


¢ Down-conversions from float32 to any small floating-point number ignore MXCSR.FZ and always preserve 
output denormals. 


4.3 Extended Addressing Displacements 


Address displacements used by memory operands to the Knights Corner instructions vector instructions, as well 
as MVEX-encoded versions of VPREFETCH and CLEVICT, operate differently than do normal x86 displacements. 
Knights Corner instructions 8-bit displacements (i.e. when MOD.mod=01) are reinterpreted so that they are 
multiplied by the memory operand's total size in order to generate the final displacement to be used in calcu- 
lating the effective address (32 bit displacements, which vector instructions may also use, operate normally, in 
the same way as for normal x86 instructions). Note that extended 8-bit displacements are still signed integer 
numbers and need to be sign extended. 


A given vector instruction's 8-bit displacement is always multiplied by the total number of bytes of memory 
the instruction accesses, which can mean multiplication by 64, 32, 16, 8, 4, 2 or 1, depending on any broadcast 
and/or data conversion in effect. Thus when reading a 64-byte (no conversion, no broadcast) source operand, 
for example via 


vmovaps zmm@, [rsi] 


the encoded 8-bit displacement is first multiplied by 64 (shifted left by 6) before being used in the effective 
address calculation. For 


vbroadcastss zmm@, [rsil{uintl16} // {1to16} broadcast of {uint16} data 


however, the encoded displacement would be multiplied by 2. Note that for MVEX versions of VPREFETCH and 
CLEVICT, we always use disp8*64; for VEX versions we use the standard x86 disp8 displacement. 


The use of disp8*N makes it possible to avoid using 32 bit displacements with vector instructions most of the 
time, thereby reducing code size and shrinking the required size of the paired-instruction decode window by 
3 bytes. Disp8*N overcomes disp8 limitations, as it is simply too small to access enough vector operands to be 
useful (only 4 64-byte operands). Moreover, although disp8*N can only generate displacements that are multi- 
ples of N, that's not a significant limitation, since Knights Corner instructions memory operands must already 
be aligned to the total number of bytes of memory the instruction accesses in order to avoid raising a #GP fault, 
and that alignment is exactly what disp8*N results in, given aligned base+index addressing. 


4.4 Swizzle/up-conversion exceptions 


There is aset of Knights Corner instructions that do not accept all regular forms of memory up-conversion/register 
swizzling and raise a #UD fault for illegal combinations. The instructions are: 


e VALIGND 
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VCVTDQ2PD 


VCVTPS2PD 


VCVTUDQ2PD 


VEXP223PS 


VFMADD233PS 


VLOG2PS 


VPERMD 


VPERMF32X4 


VPMADD233D 


VPSHUFD 


VRCP23PS 


VRSQRT23PS 


Table 4.2 summarizes which up-conversion /swizzling primitives are allowed for every one of those instructions: 
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Register Memory 
Mnemonic None | {1to16} | {4to16} | swizzles | Conversions 
VALIGND yes no no no no 
VCVTDQ2PD yes yes yes yes no 
VCVTPS2PD yes yes yes yes no 
VCVTUDQ2PD yes yes yes yes no 
VEX223PS yes no no no no 
VFMADD233PS yes no yes no no 
VLOG2PS yes no no no no 
VPERMD yes no no no no 
VPERMF32X4 yes no no no no 
VPMADD233D yes no yes no no 
VPSHUFD yes no no no no 
VRCP23PS yes no no no no 
VRSQRT23PS yes no no no no 


Table 4.2: 


Summary of legal and illegal swizzle/conversion primitives for special instructions. 
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4.5 Accessing uncacheable memory 


When accessing non cacheable memory, it's important to define the amount of data that is really accessed when 
using Knights Corner instructions (mainly when Knights Corner instructions instructions are used to access to 
memory mapped I/O regions). Depending on the memory region accessed, an access may cause that a mapped 
device behave differently. 


Knights Corner instructions, when accessing to uncacheable memory access, can be categorized in four different 
groups: 

¢ regular memory read operations 

¢ vloadunpackh*/vloadunpackl* 

¢ vgatherd* 


* memory store operations 


4.5.1 Memory read operations 


Any Knights Corner instructions that read from memory, apart from vloadunpackh*/vloadunpackl* and vgath- 
erd, access as many consecutive bytes as dictated by the combination of memory SwizzUpConv modifiers. 


4.5.2 vloadunpackh*/vioadunpackl* 


vloadunpackh*/vloadunpack1x instructions are exceptions to the general rule. Those two instructions will al- 
ways access 64 bytes of memory. The memory region accessed is between effective_address & ( 0x3F) and (ef- 
fective_address & ( Ox3F)) + 63 in both cases. 


4.5.3 vgatherd* 


vgatherd instructions are able to gather to up to 16 32 bit elements. The amount of elements accessed is deter- 
mined by the number of bits set in the vector mask provided as source. Vgatherd* instruction will access up to 16 
different 64-byte memory regions when gathering the elements. Note that, depending on the implementation, 
only one 64-byte memory access is performed for a variable number of vector elements located in that region. 


Each accessed regions will be between element_effective_address & ( Ox3F) and (element_effective_address & 
( Ox3F)) + 63. 


4.5.4 Memory stores 


All Knights Corner instructions that perform memory store operations, update those memory positions deter- 
mined by the vector mask operand. Vector mask specifies which elements will be actually stored in memory. 
DownConv* determine the number of bytes per element that will be modified in memory. 
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4.6 Floating-point Notes 


4.6.1 Rounding Modes 


VRNDFXPNTPS and conversion instructions with float32 sources, such as VCVTFXPNTPS2DQ, support four se- 
lectable rounding modes: round to nearest (even), round toward negative infinity (round down), round toward 
positive infinity (round up), and round toward zero, These are the standard IEEE rounding modes; see IA-32 
Intel® Architecture Software Developer's Manual: Volume 1, Section 4.8.4, for details. 


Knights Corner introduces general support for all four rounding-modes mandated for binary floating-point 
arithmetic by the IEEE Standard 754-2008. 


4.6.1.1 Swizzle-explicit rounding modes 


Knights Corner introduces the option of specifying the rounding-mode per instruction via a specific regis- 
ter swizzle mode (by setting the EH bit to 1). This specific rounding-mode takes precedence over whatever 
MXCSR.RC specifies. 


For those instructions (like VRNDFXPNTPS) where an explicit rounding-mode is specified via immediate, this 
immediate takes precedence over a swizzle-explicit rounding-mode embedded into the encoding of the instruc- 
tion. 


The priority of the rounding-modes of an instruction hence becomes (from highest to lowest): 


1. Rounding mode specified in the instruction immediate (if any) 
2. Rounding mode specified is the instruction swizzle attribute 


3. Rounding mode specified in RC bits of the MXCSR 


4.6.1.2 Definition and propagation of NaNs 


The IA-32 architecture defines two classes of NaNs: quiet NaNs (QNaNs) and signaling NaNs (SNaNs). Quiet 
NaNs have 1 as their first fraction bit, SNaNs have 0 as their first fraction bit. An SNaN is quieted by setting its 
first first fraction bit to 1. The class of a NaN (quiet or signaling) is preserved when converting between different 
precisions. 


The processor never generates an SNaN as a result of a floating-point operation with no SNaN operands, so 
SNaNs must be present in the input data or have to be inserted by the software. 


QNaNs are allowed to propagate through most arithmetic operations without signaling an exception. Note also 
that Knights Corner instructions do not trap for arithmetic exceptions, as floating-point exceptions are always 
masked. 


If any operation has one or more NaN operands then the result, in most cases, is a QNaN that is one of the input 
NaNs, quieted if it is an SNaN. This is chosen as the first NaN encountered when scanning the operands from left 
to right, as presented in the instruction descriptions from Chapter 6. 


If any floating-point operation with operands that are not NaNs leads to an indefinite result (e.g. 0/0, 0 x 00, or 
oo — ov), the result will be QNaN Indefinite: OxFFC00000 for 32 bit operations and OxFFF8000000000000 for 
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64 bit operations. 


When operating on NaNs, if the instruction does not define any other behavior, Table 4.3 describes the NaN 
behavior for unary and binary instructions. Table 4.4 shows the NaN behavior for ternary fused multiply and 
add/sub operations. This table can be derived by considering the operation as a concatenation of two binary op- 
erations. The first binary operation, the multiply, produces the product. The second operation uses the product 
as the first operand for the addition. 


Source operands Result 

SNaN SNaN source operand, converted into a QNaN 

QNaN QNaN source operand 

SNaN and QNaN First operand (if this operand is an SNaN, it is con- 
verted to a QNaN) 

Two SNaNs First operand converted to a QNaN 

Two QNaNs First operand 

SNaN and a floating-point value SNaN source operand, converted into a QNaN 

QNaN and a floating-point value QNaN source operand 


Table 4.3: Rules for handling NaNs for unary and binary operations. 


4.6.1.3 Signed Zeros 


Zero can be represented as a +0 or a —0 depending on the sign bit. Both encodings are equal in value. The sign 
of a zero result depends on the operation being performed and the rounding mode being used. 


Knights Corner instructions introduces the fused "multiply and add" and "multiply and sub" operations. These 
consist of a multiplication (whose sign is possibly negated) followed by an addition or subtraction, all calculated 
with just one rounding error. 


The sign of the multiplication result is the exclusive-or of the signs of the multiplier and multiplicand, regardless 
of the rounding mode (a positive number has a sign bit of 0, and a negative one, a sign bit of 1). 


The sign of the addition (or subtraction) result is in general that of the exact result. However, when this result 
is exactly zero, special rules apply: when the sum of two operands with opposite signs (or the difference of two 
operands with like signs) is exactly zero, the sign of that sum (or difference) is +0 in all rounding modes, except 
round down; in that case, the sign of an exact zero sum (or difference) is —0. This is true even if the operands 
are zeros, or denormals treated as zeros because MXCSR.DAZ is set to 1. Note that x + « = 2 — (—z) retains the 
same sign as x even when x is zero; in particular, (+0) + (+0) = +0, and (—0) + (—0) = —0O, in all rounding 
modes. 


When (a x b) + cis exactly zero, the sign of fused multiply-add/subtract shall be determined by the rules above 
for a sum of operands. When the exact result of +(a x b) + cis non-zero yet the final result is zero because of 
rounding, the zero result takes the sign of the exact result. 


The result for "fused multiply and add" follows by applying the following algorithm: 


© (Xa, Ya; Za) =DAZ applied to (Src1, Sre2, Src3) (denormal operands, if any, are treated as zeros of the same 
sign as the operand; other operands are not changed) 


¢ Resultg = Xa X Ya + Za computed exactly then rounded to the destination precision. 
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vfmadd231ps vfmadd132ps vfmadd213ps  vfmadd233ps* 
vfmsub231ps  vfnmsub132ps  vfnmsub213ps 
vfnmadd231ps  vfmsub132ps vfmsub213ps 
vmadd231pd vfnmadd132ps vfnmadd213ps 
vfnmsub231pd vmadd132pd vmadd213pd 
vfmsub231pd vfnmsub132pd vfnmsub213pd 
vfnmadd231pd vfmsub132pd vfmsub213pd 
Src1 Src2 Src3 vfnmadd132pd_ vfnmadd213pd 
NaN;, NaNo, NaNg qNaNo2 qNaN, qNaNo2 qNaN2 
NaN;, NaNo, value qNaN2 qNaN, qNaN2 qNaN2 
NaN, value, NaN3 qNaN3 qNaN, qNaN, qNaN3 
value, NaNg, NaN3 qNaNo2 qNaN3 qNaNo qNaNo2 
NaN, value, value qNaN, qNaN, qNaN, qNaN, 
value, NaNo, value qNaN2 qNaN2 qNaN2 qNaN2 
value, vaule, NaN3 qNaN3 qNaN3 qNaN3 qNaN3 


Table 4.4: Rules for handling NaNs for fused multiply and add/sub operations (ternary). 


“The interpretation of the sources is slightly different for this instruction. Here the Src1 column and NaN; are associated with Src3[31:0]. Similarly the Src3 column and NaNg are 


associated with Src3[63:32]. 
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¢ Result = FTZ applied to Result, (tiny results are replaced by zeros of the same sign; other results are not 
changed). 


4.6.2 REX prefix and Knights Corner instructions interactions 


The REX prefix is illegal in combination with Knights Corner instructions vector instructions, or with mask and 
scalar instructions allocated using VEX and MVEX prefixes. 


Following the Intel® 64 behavior, if the REX prefix is followed with any legacy prefix and not located just before 
the opcode escape, it will be ignored. 


4.7 Knights Corner instructions State Save 


Knights Corner does not include any explicit instruction to perform context save and restore of Knights Corner 
state. To perform a context save and restore we may use: 


¢ Vector loads and stores for vector registers 
¢ Acombination of kmov plus scalar loads and stores for mask registers 


e LDMXCSR/STMXCSR for the MXCSR state register 


Note also that vector instructions raise a device-not-available (#NM) exceptions when CRO.TS is set. This allows 
to perform selective lazy save and restore of state. 


4.8 Knights Corner instructions Processor State After Reset 


Table 4.5 shows the state of the flags and other registers following power-up for Knights Corner. 


Reference Number: 327364-001 65 


CHAPTER 4. FLOATING-POINT ENVIRONMENT, MEMORY ADDRESSING, AND PROCESSOR STATE 


> 
D 


Register Knights Corner 

EFLAGS 00000002H 

EIP OOOOFFFOH 

CRO 60000010H2 

CR2, CR3, CR4 00000000H 

cs Selector = FOOOH; Base = FFFFOOO0H 


Limit = FFFFH 
AR = Present, R/W, Accessed 


SS, DS, ES, FS, GS 


Selector = 0000H; Base = 00000000H 
Limit = FFFFH 
AR = Present, R/W, Accessed 


EDX 000005xxH 

EAX 04 

EBX, ECX, ESI, EDI, EBP, ESP 00000000H 

STO through ST7 Pwr up or Reset: +0.0 
FINIT/FNINIT: Unchanged 

x87 FPU Control Word Pwr up or Reset: 0040H 
FINIT/FNINIT: 037FH 

x87 FPU Status Word Pwr up or Reset: 0000H 
FINIT/FNINIT: 0000H 

x87 FPU Tag Word Pwr up or Reset: 5555H 


FINIT/FNINIT: FFFFH 


x87 FPU Data Operand and CS 
Seg. Selectors 


Pwr up or Reset: 0000H 
FINIT/FNINIT: 0000H 


x87 FPU Data Operand and Pwr up or Reset: 00000000H 

Inst. Pointers FINIT/FNINIT: 00000000H 

MMO through MM7 NA 

XMMO0 through XMM7 NA 

k0 through k7 0000H 

zmm0 through zmm31 0 (64 bytes) 

MXCSR 0020_0000H 

GDTR, IDTR Base = 00000000H, Limit = FFFFH 


AR = Present, R/W 


LDTR, Task Register 


Selector = 0000H, Base = 00000000H 
Limit = FFFFH 
AR = Present, R/W 


DRO, DR1, DR2, DR3 00000000H 
DR6 FFFFOFFOH 
DR7 00000400H 


Time-Stamp Counter 


Power up or Reset: 0H 
INIT: Unchanged 


Perf. Counters and Event Select 


Power up or Reset: 0H 
INIT: Unchanged 


All Other MSRs 


Power up or Reset: Undefined 
INIT: Unchanged 


Data and Code Cache, TLBs 


Invalid 


MTRRs, Machine-Check 


Not Implemented 


APIC 


Pwr up or Reset: Enabled 
INIT: Unchanged 


Table 4.5: Processor State Following Power-up, Reset, or INIT. 
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Chapter 5 


Instruction Set Reference 


Knights Corner instructions that are described in this document follow the general documentation convention 
established in this chapter. 


5.1 Interpreting Instruction Reference Pages 


This section describes the format of information contained in the instruction reference pages in this chapter. It 
explains notational conventions and abbreviations used in these sections 


5.1.1. Instruction Format 


The following is an example of the format used for each instruction description in this chapter. 


Opcode Instruction Description 


MVEX.NDS.512.66.0F38.W150/r vaddnpdzmm1k1,zmm2,Sfe4(zmm3/m;) Add _ float64 vector 
zmmz2 and float64 vector 
Sf64(zmm3/mt), negate 
the sum, and store the 
result in zmm1, under 
write-mask. 

VEX.OF.WO 41 /r kand k1 , k2 Perform a bitwise AND 
between k1 and k2, store 
result in k1 


5.1.2 Opcode Notations for MVEX Encoded Instructions 


In the Instruction Summary Table, the Opcode column presents the details of each instruction byte encoding 
using notations described in this section. For MVEX encoded instructions, the notations are expressed in the 
following form (including the modR/M byte if applicable, and the immediate byte if applicable): 
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MVEX.[NDS,NDD].[512].[66,F2,F3].@F/QF3A/0F38.[W0,W1] opcode [/r] [/ib] 


e MVEX: indicates the presence of the MVEX prefix is required. The MVEX prefix consists of 4 bytes with the 
leading byte 62H. 
The encoding of various sub-fields of the MVEX prefix is described using the following notations: 


- NDS,NDD: specifies that MVEX.vvwv field is valid for the encoding of a register operand: 


* MVEX.NDS: MVEX.vvwv encodes the first source register in an instruction syntax where the con- 
tent of source registers will be preserved. To encode a vector register in the range zmm16- 
zmm31, the MVEX.vvwv field is pre-pended with MVEX.V'. 

* MVEX.NDD: MVEX.vvwv encodes the destination register that cannot be encoded by ModR/M:reg 
field. To encode a vector register in the range zmm16-zmm31, the MVEX.vvwv field is pre-pended 
with MVEX.V’. 

* If none of NDS, NDD is present, MVEX.vvvv must be 1111b (i.e. MVEX.vvvv does not encode an 
operand). 

- 66,F2,F3: The presence or absence of these value maps to the MVEX.pp field encodings. If absent, 
this corresponds to MVEX.pp=OOB. If present, the corresponding MVEX.pp value affects the "opcode" 
byte in the same way as if a SIMD prefix (66H, F2H or F3H) does to the ensuing opcode byte. Thus a 
non-zero encoding of MVEX.pp may be considered as an implied 66H/F2H/F3H prefix. 


OF OF3A,0F38: The presence of these values maps to a valid encoding of the MVEX.mmmm field. Only 
three encoded values of MVEX.mmmm are defined as valid, corresponding to the escape byte se- 
quence of OFH, OF3AH and OF38H. 


- W0: MVEX.W=0 
- W1: MVEX.W=1 


- The presence of W0/W1 in the opcode column applies to two situations: (a) it is treated as an ex- 
tended opcode bit, (b) the instruction semantics support an operand size promotion to 64 bit of a 
general-purpose register operand or a 32 bit memory operand. 


¢ opcode: Instruction opcode. 
e /r: Indicates that the ModR/M byte of the instruction contains a register operand and an r/m operand. 
e /vsib: Indicates the memory addressing uses the vector SIB byte. 


e ib: A1-byte immediate operand to the instruction that follows the opcode, ModR/M bytes or scale/indexing 
bytes. 


In general, the encoding of the MVEX.R, MVEX.X, MVEX.B, and MVEX.V' fields are not shown explicitly in the 
opcode column. The encoding scheme of MVEX.R, MVEX.X, MVEX.B, and MVEX.V' fields must follow the rules 
defined in Chapter 3. 


5.1.3. Opcode Notations for VEX Encoded Instructions 


In the Instruction Summary Table, the Opcode column presents the details of each instruction byte encoding 
using notations described in this section. For VEX encoded instructions, the notations are expressed in the fol- 
lowing form (including the modR/M byte if applicable, the immediate byte if applicable): 


VEX. [NDS,NDD].[66,F2,F3].QF/@F3A/QF38.[W0,W1] opcode [/r] [/ib] 
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VEX: indicates the presence of the VEX prefix is required. The VEX prefix can be encoded using the 
three-byte form (the first byte is C4H), or using the two-byte form (the first byte is CSH). The two-byte 
form of VEX only applies to those instructions that do not require the following fields to be encoded: 
VEX.mmmmm, VEX.W, VEX.X, VEX.B. Refer to Chapter 3 for more details on the VEX prefix. 

The encoding of various sub-fields of the VEX prefix is described using the following notations: 


- NDS,NDD: specifies that VEX.vvvv field is valid for the encoding of a register operand: 


* VEX.NDS: VEX.vvvv encodes the first source register in an instruction syntax where the content 
of source registers will be preserved. 

* VEX.NDD: VEX.vvvv encodes the destination register that cannot be encoded by ModR/M:reg 
field. 

* If none of NDS, NDD is present, VEX.vvvv must be 1111b (i.e. VEX.vvvv does not encode an 
operand). The VEX.vvwv field can be encoded using either the 2-byte or 3-byte form of the VEX 
prefix. 


- 66,F2,F3: The presence or absence of these value maps to the VEX.pp field encodings. If absent, this 
corresponds to VEX.pp=OOB. If present, the corresponding VEX.pp value affects the "opcode" byte in 
the same way as if aSIMD prefix (66H, F2H or F3H) does to the ensuing opcode byte. Thus a non-zero 
encoding of VEX.pp may be considered as an implied 66H/F2H/F3H prefix. The VEX.pp field may be 
encoded using either the 2-byte or 3-byte form of the VEX prefix. 


- OFOF3A,0F38: The presence of these values maps to a valid encoding of the VEX.mmmmm field. Only 
three encoded values of VEX.mmmmm are defined as valid, corresponding to the escape byte se- 
quence of OFH, OF3AH and OF38H. The effect of a valid VEX.mmmmm encoding on the ensuing opcode 
byte is same as if the corresponding escape byte sequence on the ensuing opcode byte for non-VEX 
encoded instructions. Thus a valid encoding of VEX.mmmmm may be consider as an implies escape 
byte sequence of either OFH, OF3AH or OF38H. The VEX.mmmmzm field must be encoded using the 
3-byte form of VEX prefix. 


- OFOF3A,0F38 and 2-byte/3-byte VEX: The presence of OF3A and OF38 in the opcode column implies 
that opcode can only be encoded by the three-byte form of VEX. The presence of OF in the opcode 
column does not preclude the opcode to be encoded by the two-byte of VEX if the semantics of the 
opcode does not require any subfield of VEX not present in the two-byte form of the VEX prefix. 


- W0: VEX.W=0 
- W1: VEX.W=1 


- The presence of W0/W1 in the opcode column applies to two situations: (a) it is treated as an ex- 
tended opcode bit, (b) the instruction semantics support an operand size promotion to 64 bit of a 
general-purpose register operand or a 32 bit memory operand. The presence of W1 in the opcode 
column implies the opcode must be encoded using the 3-byte form of the VEX prefix. The presence 
of WO in the opcode column does not preclude the opcode to be encoded using the C5H form of the 
VEX prefix, if the semantics of the opcode does not require other VEX subfields not present in the 
two-byte form of the VEX prefix. 


opcode: Instruction opcode. 
/v: Indicates that the ModR/M byte of the instruction contains a register operand and an r/m operand. 


ib: A1-byte immediate operand to the instruction that follows the opcode, ModR/M bytes or scale/indexing 
bytes. 


In general, the encoding of the VEX.R, VEX.X, and VEX.B fields are not shown explicitly in the opcode col- 
umn. The encoding scheme of VEX.R, VEX.X, and VEX.B fields must follow the rules defined in Chapter 
3. 
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Chapter 6 


Instruction Descriptions 


This Chapter defines all of the Knights Corner instructions vector instructions. Note: Some instruction descrip- 
tions refer to the SSS or 525,59, which are bits 6-4 from the MVEX prefix encoding. See Table 2.14 for more 
details 
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JKNZD - Jump near if mask is not zero 


Opcode Instruction Description 

VEX.NDS.128.0F.WO 85 id jknzd k1, rel32 Jump near if mask is not zero. 

VEX.NDS.128.W0 75 ib jknzd k1, rel8 Jump near if mask is not zero. 
Description 


Checks the value of source mask, and if not all mask bits are set to 0, performs a jump to 
the target instruction specified by the destination operand. If the condition is not satis- 
fied, the jump is not performed and execution continues with the instruction following 
the instruction. 


The target instruction is specified with a relative offset (a signed offset relative to the 
current value of the instruction pointer in the EIP register). A relative offset (rel8, rel16, 
or rel32) is generally specified as a label in assembly code, but at the machine code level, 
itis encoded as a signed, 8-bit or 32 bit immediate value, which is added to the instruction 
pointer. Instruction coding is most efficient for offsets of -128 to +127. Ifthe operand-size 
attribute is 16, the upper two bytes of the EIP register are cleared, resulting ina maximum 
instruction pointer size of 16 bits. 


The instruction does not support far jumps (jumps to other code segments). When the 
target for the conditional jump is in a different segment, use the opposite condition from 
the condition being tested for the JKNZD instruction, and then access the target with an 
unconditional far jump (JMP instruction) to the other segment. For example, the following 
conditional far jump is illegal: 


JKNZD FARLABEL; 
To accomplish this far jump, use the following two instructions: 


JKZD BEYOND; 
JMP FARLABEL; 
BEYOND: 


This conditional jump is converted to code fetch of one or two cache lines, regardless of 
jump address or cacheability. 


In 64 bit mode, operand size (OSIZE) is fixed at 64 bits. JMP Short is RIP = RIP + 8-bit 
offset sign extended to 64 bits. JMP Near is RIP = RIP + 32 bit offset sign extended to 64 
bits. 
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Operation 


if (k1[15:0]!=0) 
{ 
tempEIP = EIP + SignExtend(DEST); 


if(OSIZE == 16) 


{ 
tempEIP = tempEIP & QQQQFFFFH; 


} 


if (*tempEIP is not within code segment limit*) 
{ 

#GP(Q); 
} 


else 


i! 
EIP = tempEIP 


} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


None. 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#GP(0) If the memory address is in a non-canonical form. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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JKZD - Jump near if mask is zero 


Opcode Instruction Description 

VEX.NDS.128.0F. WO 84 id jkzd k1, rel32 Jump near if mask is zero. 

VEX.NDS.128.W0 74 ib jkzd k1, rel8 Jump near if mask is zero. 
Description 


Checks the value of source mask, and if all mask bits are set to 0, performs a jump to the 
target instruction specified by the destination operand. If the condition is not satisfied, 
the jump is not performed and execution continues with the instruction following the in- 
struction. 


The target instruction is specified with a relative offset (a signed offset relative to the 
current value of the instruction pointer in the EIP register). A relative offset (rel8, rel16, 
or rel32) is generally specified as a label in assembly code, but at the machine code level, 
itis encoded as a signed, 8-bit or 32 bit immediate value, which is added to the instruction 
pointer. Instruction coding is most efficient for offsets of -128 to +127. Ifthe operand-size 
attribute is 16, the upper two bytes of the EIP register are cleared, resulting ina maximum 
instruction pointer size of 16 bits. 


The instruction does not support far jumps (jumps to other code segments). When the 
target for the conditional jump is in a different segment, use the opposite condition from 
the condition being tested for the JKNZD instruction, and then access the target with an 
unconditional far jump (JMP instruction) to the other segment. For example, the following 
conditional far jump is illegal: 


JKZD FARLABEL; 
To accomplish this far jump, use the following two instructions: 


JKNZD BEYOND; 
JMP FARLABEL; 
BEYOND: 


This conditional jump is converted to code fetch of one or two cache lines, regardless of 
jump address or cacheability. 


In 64 bit mode, operand size (OSIZE) is fixed at 64 bits. JMP Short is RIP = RIP + 8-bit 
offset sign extended to 64 bits. JMP Near is RIP = RIP + 32 bit offset sign extended to 64 
bits. 
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Operation 


if (k1[15:0]==0) 
{ 
tempEIP = EIP + SignExtend(DEST); 


if(OSIZE == 16) 


{ 
tempEIP = tempEIP & QQQQFFFFH; 


} 


if (*tempEIP is not within code segment limit*) 
{ 

#GP(Q); 
} 


else 


i! 
EIP = tempEIP 


} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


None. 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#GP(0) If the memory address is in a non-canonical form. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KAND - AND Vector Mask 


Opcode Instruction Description 
VEX.128.0F.WO 41 /r kand k1, k2 Perform a bitwise AND between vector masks 
k1 and k2 and store the result in vector mask 
k1. 
Description 


Performs a bitwise AND between the vector masks k2 and the vector mask k1, and writes 


the result into vector mask k1. 


Operation 
for (n = @; n < 16; n++) { 


k1[n] = k1[n] & k2[En] 
t 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_kand (_mmask16,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KANDN - AND NOT Vector Mask 


Opcode Instruction Description 

VEX.128.0F.WO 42 /r kandn k1, k2 Perform a bitwise AND between NOT (vector 
mask k1) and vector mask k2 and store the re- 
sult in vector mask k1. 


Description 


Performs a bitwise AND between vector mask k2, and the NOT (bitwise logical negation) 
of vector mask k1, and writes the result into vector mask k1. 


Operation 


for (n = @; n < 16; n++) { 
k1En] = (~(k1[n])) & k2[n] 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_kandn (_mmask16,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KANDNR - Reverse AND NOT Vector Mask 


Opcode Instruction Description 
VEX.128.0F.WO 43 /r kandnr k1, k2 Perform a bitwise AND between NOT (vector 
mask k2) and vector mask k1 and store the re- 
sult in vector mask k1. 
Description 


Performs a bitwise AND between the NOT (bitwise logical negation) of vector mask k2, 
and the vector mask k1, and writes the result into vector mask k1. 


Operation 


for (n = @; n < 16; nt++) { 
k1[n] = ~(k2£n]) & ki[n] 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—mmask16 _mm512_kandnr (_mmask16, __mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KCONCATH - Pack and Move High Vector Mask 


Opcode Instruction Description 
VEX.NDS.128.0FW0 95 /r_ kconcath r64,k1,k2 Concatenate vector masks k1 and k2 into the high part 
of register r64. 


Description 


Packs vector masks k1 and k2 and moves the result to the high 32 bits of destination reg- 
ister r64. The rest of the destination register is zeroed. 


Operation 


TMP[15:@] = k2[15:0] 
TMP[31:16] = k1[15:0] 
r64[31:0] = 2 
r64[63:32] = TMP 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_int64 _mm512_kconcathi_64(_mmask16, _mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
Protected and Compatibility Mode 
#UD Instruction not available in these modes 
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64 bit Mode 


#NM 


Reference Number: 327364-001 


If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If destination is a memory operand. 
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KCONCATL - Pack and Move Low Vector Mask 


Opcode Instruction Description 
VEX.NDS.128.0FW0 97 /r kconcatl r64,k1,k2 Concatenate vector masks k1 and k2 into the low part of 
register r64. 


Description 


Packs vector masks k1 and k2 and moves the result to the low 32 bits of destination reg- 
ister r64. The rest of the destination register is zeroed. 


Operation 


TMP[15:@] = k2[15:0] 
TMP[31:16] = k1[15:0] 
r64[31:0] = TMP 
r64[63:32] = a 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_int64 _mm512_kconcatlo_64(_mmask16,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
Protected and Compatibility Mode 
#UD Instruction not available in these modes 
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64 bit Mode 


#NM 


Reference Number: 327364-001 


If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If destination is a memory operand. 
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KEXTRACT - Extract Vector Mask From Register 


Opcode Instruction Description 
VEX.128.66.0F3A.WO 3E /rib  kextractk1,r64,imm8 Extract field from general purpose register r64 
into vector mask k1 using imm8. 


Description 


Extract the 16-bit field selected by imm8[1:0] from general purpose register r64 and write 
the result into destination mask register k1. 


Operation 
index = imm8[1:0] * 16 


k1£15:0] = r64Lindex+15: index] 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_kextract_64(__int64, const in); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM 


Reference Number: 327364-001 


If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If source is a memory operand. 
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KMERGE2L1H - Swap and Merge High Element Portion and Low Portion of 
Vector Masks 


Opcode Instruction Description 
VEX.128.0F.W0 48 /r kmerge2l1hk1,k2 Concatenate the low half of vector mask k2 and the high half of 
vector mask k1 and store the result in the vector mask k1. 


Description 


Move high element from vector mask register k1 into low element of vector mask register 
k1, and insert low element of k2 into the high portion of vector mask register k1. 


Operation 


tmp = k1[15:8] 
k1[15:8] = k2[7:0] 


k1[7:0] = tmp 
Flags Affected 
None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 
_mmask16 _mm512_kmerge211h (_mmask16,__mmask16 k2); 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KMERGE2L1L - Move Low Element Portion into High Portion of Vector 
Mask 


Opcode Instruction Description 
VEX.128.0F.W0 49 /r kmerge2l11k1,k2 Movelowhalfofvector mask k2 into the high half of vector mask 
k1. 
Description 


Insert low element from vector mask register k2 into high element of vector mask register 
k1. Low element of k1 remains unchanged. 


Operation 
k1[15:8] = k2[7:0] 


*k1£7:0] remains unchanged* 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_kmerge211] (_mmask16, _mmask16 k2); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KMOV - Move Vector Mask 


Opcode Instruction Description 

VEX.128.0F.W0 90 /r kmovk1i1, k2 Move vector mask k2 and store the result in k1. 
VEX.128.0F.W0 93 /r kmovr32,k2 Move vector mask k2 to general purpose register r32. 
VEX.128.0F.W0 92 /r kmovki,r32 Move general purpose register r32 to vector mask k1. 


Description 


Either the vector mask register k2 or the general purpose register r32 is read, and its 
contents written into destination general purpose register r32 or vector mask register k1; 
however, general purpose register to general purpose register copies are not supported. 
When the destination is a general purpose register, the 16 bit value that is copied is zero- 
extended to the maximum operand size in the current mode. 


Operation 


if(DEST is a general purpose register) { 
DEST[63:16] = @ 
DESTL15:@] = k2[15:0] 
} else if(DEST is vector mask and SRC is a general purpose register) { 
k1[15:@] = SRC[15:0] 
} else { 
k1£15:0] = k2[15:0] 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_kmov (_mmask16); 
_mmask16 _mm512_int2mask (int); 
int _mm512_mask2int (_mmask16); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If source/destination is a memory operand. 


Reference Number: 327364-001 95 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


KNOT - Not Vector Mask 


Opcode Instruction Description 
VEX.128.0F.WO 44 /r knot k1, k2 Perform a bitwise NOT on vector mask k2 and 
store the result in k1. 


Description 


Performs the bitwise NOT of the vector mask k2, and writes the result into vector mask 
k1. 


Operation 


for (n = @; n < 16; n++) { 
k1£[—n] = ~ k2[n] 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_knot(_mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KOR - OR Vector Masks 


Opcode Instruction Description 
VEX.128.0F.WO 45 /r kor k1, k2 vector masks k1 and k2 and store the result in 
vector mask k1. 


Description 


Performs a bitwise OR between the vector mask k2, and the vector mask k1, and writes 
the result into vector mask k1. 


Operation 


for (n = @; n < 16; n++) { 
k1En] = k1[n] | k2{£n] 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—mmask16 _mm512_kor(_mmask16,_mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KORTEST - OR Vector Mask And Set EFLAGS 


Opcode Instruction Description 
VEX.128.0F.WO 98 /r kortest k1, k2 vector masks k1 and k2 and update ZF and CF 
EFLAGS accordingly. 
Description 


Performs a bitwise OR between the vector mask register k2, and the vector mask register 
k1, and sets CF and ZF based on the operation result. 


ZF flag is set if both sources are 0x0. CF is set if, after the OR operation is done, the oper- 


ation result is all 1's. 


Operation 
Cra 
ZF = 1 


for (n = @; n < 16; n++) { 
tmp = (k1[n] | k2[n]) 
ZF &= (tmp == @x@) 
CF & (tmp == x1) 

} 


Flags Affected 


¢ The ZF flag is set if the result of OR-ing both sources is all Os 
¢ The CF flag is set if the result of OR-ing both sources is all 1s 


e The OF SF AF, and PF flags are set to 0. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


int _mm512_kortestz (_mmask16,__mmask16); 
int _mm512_kortestc (_mmask16,__mmask16); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KXNOR - XNOR Vector Masks 


Opcode Instruction Description 
VEX.128.0F. WO 46 /r kxnor k1, k2 vector masks k1 and k2 and store the result in 
vector mask k1. 


Description 


Performs a bitwise XNOR between the vector mask k1 and the vector mask k2, and the 
result is written into vector mask k1. 


The primary purpose of this instruction is to provide a way to set a vector mask register 
to OxFFFF in a single clock; this is accomplished by selecting the source and destination to 
be the same mask register. In this case the result will be OxFFFF regardless of the original 
contents of the register. 


Operation 


for (n = @; n < 16; n++) { 
k1£n] = ~(k1[£n] * k2f£n]) 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—mmask16 _mm512_kxnor (_mmask16,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
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#UD 


64 bit Mode 


#NM 


Reference Number: 327364-001 


Instruction not available in these modes 


If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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KXOR - XOR Vector Masks 


Opcode Instruction Description 
VEX.128.0F.WO 47 /r kxor k1, k2 vector masks k1 and k2 and store the result in 
vector mask k1. 


Description 


Performs a bitwise XOR between the vector mask k2, and the vector mask k1, and writes 
the result into vector mask k1. 


Operation 


for (n = @; n < 16; n++) { 
k1£n] = k1[n] * k2[£n] 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—mmask16 _mm512_kxor (_mmask16,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 


#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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6.2 Vector Instructions 
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VADDNPD - Add and Negate Float64 Vectors 


Opcode Instruction 


50 /r S'rga(zmm3/m,) 


Description 


MVEX.NDS.512.66.0F38.W1 vaddnpd zmm1 {k1}, zmm2, Add float64 vector zmm2 and float64 vector 


S'rea(zmm3/m,), negate the sum, and store the 
result in zmm1, under write-mask. 


Description 


Performs an element-by-element addition between float64 vector zmm2 and the float64 
vector result of the swizzle/broadcast/conversion process on memory or float64 vector 
zmm3, then negates the result. The final result is written into float64 vector zmm1. 


Note that all the operations must be performed before rounding. 


xy RN/RU/RZ 


+0 +0 |) (-0) +(-0) =-0 | (-0) 
+0 -0 |[ (0) +(+0) =+0/ (0) 
-0 +0 |[ (0) +(-0) =+0 | (+0) 
0-0 || G0) +(+0) =+0 | (40) 


RD 
+(-0) =-0 
+(+0) =-0 
+(-0) =-0 
+(+0) =+0 


Table 6.1: VADDN outcome when adding zeros depending on rounding-mode. See Signed Zeros in Section 4.6.1.3 


for other cases with a result of zero. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(ki[n] != @) { 
i = 64x*n 
// float64 operation 


zmm1[it+63:i] = (-zmm2[it+63:i]) + (-tmpSrc3[it+63:i]) 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_addn_pd(_m512d,__m512d); 
—m512d _mm512_mask_addn_pd(_m512d,_mmask8,_m512d,__m512d); 


Memory Up-conversion: S ¢¢4 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S25S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 

Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 


#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VADDNPS - Add and Negate Float32 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vaddnps zmm1 {k1}, zmm2, Add float32 vector zmm2 and float32 vector 

50 /r S'p30(zmm3/m_) S'p32(zmm3/m;), negate the sum, and store the 
result in zmm1, under write-mask. 


Description 


Performs an element-by-element addition between float32 vector zmm2 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or float32 vector 
zmm3, then negates the result. The final result is written into float32 vector zmm1. 


Note that all the operations must be performed before rounding. 


x y RN/RU/RZ RD 

+0 +0] (-0) +(-0) =-0 | (-0) +(0) =-0 
+0 -0 | (0) +(+0) =+0) (0) +(+0) =-0 
0. 90) CO) CO) Ss0)) (0) 0) 0 
-0 -0 (+0) +(4+0) =+0/) (+0) +(+0) =+0 


Table 6.2: VADDN outcome when adding zeros depending on rounding-mode. See Signed Zeros in Section 4.6.1.3 
for other cases with a result of zero. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32an 
// float32 operation 
zmm1[it+31:i] = (-zmm2[i+31:i]) + (-tmpSrc3[it+31:i]) 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_addn_ps (_m512,__m512); 
—m512 _mm512_mask_addn_ps (_m512,_mmask16,_m512,_m512); 


Memory Up-conversion: S 32 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S25S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 

Exceptions 
Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 
Protected and Compatibility Mode 
#UD Instruction not available in these modes 


64 bit Mode 
#SS(0) 


#GP(0) 


If a memory address referencing the SS segment is 
in a non-canonical form. 
If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
For a page fault. 
If CRO.TS[bit 3]=1. 


#PF(fault-code) 
#NM 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VADDPD - Add Float64 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0FW1 vaddpd zmm1_ {ki}, zmm2, Add float64 vector zmm2 and float64 vector 

58 /r Sea(zmm3/mz) Sea(zmm3/m,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element addition between float64 vector zmm2 and the float64 
vector result of the swizzle/broadcast/conversion process on memory or float64 vector 
zmm3. The result is written into float64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 64*n 
// float64 operation 
zmm1[i+63:i] = zmm2[i+63:i] + tmpSrc3[it+63:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 
Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512d _mm512_add_pd(_m512d,__m512d); 
—m512d _mm512_mask_add_pd(_m512d,_mmask8,__m512d,_m512d); 


Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S25S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 

Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 


#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 


Reference Number: 327364-001 


115 


CHAPTER 6. INSTRUCTION DESCRIPTIONS (intel 
VADDPS - Add Float32 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.0FW058/r vaddps zmm1_ {kl}, zmm2, Add float32 vector zmm2 and float32 vector 
S'f32(zmm3/mz) S'f32(zmm3/m,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element addition between float32 vector zmmz2 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or float32 vector 
zmm3. The result is written into float32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 32an 
// float32 operation 
zmm1[i+31:i] = zmm2[i+31:i] + tmpSrc3[i+31:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 
_—m512 


Memory Up-conversion: S 32 


_mm512_add_ps (_m512,__m512); 
_mm512_mask_add_ps (_m512,__mmask16,__m512,__m512); 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S25S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 

Exceptions 
Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 
Protected and Compatibility Mode 
#UD Instruction not available in these modes 


64 bit Mode 
#SS(0) 


#GP(0) 


If a memory address referencing the SS segment is 
in a non-canonical form. 
If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
For a page fault. 
If CRO.TS[bit 3]=1. 


#PF(fault-code) 
#NM 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VADDSETSPS - Add Float32 Vectors and Set Mask to Sign 


Opcode Instruction 
MVEX.NDS.512.66.0F38.W0 CC /r 


Description 


vaddsetsps zmm1 {k1}, zmm2, S732(zmm3/m;,) Add  float32  vec- 


tor zmm2_ and 
float32 vector 
Sf32(zmm3/mz) 
and store the sum in 
zmm1 and the sign 
from the sum in k1, 
under write-mask. 


Description 


Performs an element-by-element addition between float32 vector zmmz2 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or float32 vector 


zmm3. The result is written into float32 vector zmm1. 


In addition, the sign of the result for the n-th element is written into the n-th bit of vector 


mask k1. 


It is the sign bit of the final result that gets copied to the destination, as opposed to the 


result of comparison with zero. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are computed and stored into zmm1 and k1. Elements in zmm1 
and k1 with the corresponding bit clear in k1 register retain their previous value. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 

RoundingMode = SSS[1:0] 

tmpSrc3[511:0] = zmm3[511:0] 
} else { 

RoundingMode = MXCSR.RC 

tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32an 
// float32 operation 
zmm1[it+31:i] = zmm2[i+31:i] + tmpSrc3[i+31:i] 
k1£n] = zmm1[i+31] 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 _mm512_addsets_ps (_m512,__m512,__mmask16*); 


_—m512 _mm512_mask_addsets_ps (_m512, _mmask16, _m512 , _m512, 
_mmask16*); 
Memory Up-conversion: S ¢32 

S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S rs. 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 _mm512_addsets_ps (_m512,_m512,__mmask16*); 
—m512 _mm512_mask_addsets_ps (_m512, _mmask16, _m512 
_mmask16*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If no write mask is provided or selected write-mask is k0. 
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VALIGND - Align Doubleword Vectors 


Opcode Instruction Description 
MVEX.NDS.512.66.0F3A.WO valignd zmmi1 {k1}, zmm2, Shift right and merge vectors zmm2 and 
03 /rib zmm3/m:z, offset zmm3/m, with doubleword granularity using 
offset as number of elements to shift, and store 
the final result in zmm1, under write-mask. 
Description 


Concatenates and shifts right doubleword elements from vector zmm2 and memory/vector 


zmm3. The result is written into vector zmm1. 


No swizzle, broadcast, or conversion is performed by this instruction. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


src[511:0] = zmm3/m; 


// Concatenate sources 
tmp[511:0] = src[511:0] 
tmp[1023:512] = zmm2[511:0] 


// Shift right doubleword elements 
SHIFT = imm8[3:0] 
tmp[1023:0] = tmp[1023:0] >> (32*SHIFT) 


// Apply write-mask 
for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
1 = 32&n 
zmm1[it+31:i] = tmplit+31:i] 
} 
} 
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Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_alignr_epi32 (_m512i,__m512i, const int); 
—m512i _mm512_mask_alignr_epi32 (_m512i,_mmask16,__m512i,__m512i, const int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 


124 Reference Number: 327364-001 


= 
=r 
(3 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VBLENDMPD - Blend Float64 Vectors using the Instruction Mask 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.W1 vblendmpd zmm1 {k1}, zmm2, Blend float64 vector zmm2 and float64 vector 

65 /r S'rga(zmm3/m¢) S'rga(zmm3/m,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element blending between float64 vector zmm2 and the float64 
vector result of the swizzle/broadcast/conversion process on memory or float64 vector 
zmm3, using the instruction mask as selector. The result is written into float64 vector 
zmm1. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector: every element of the destination is conditionally selected between first 
source or second source using the value of the related mask bit (0 for first source, 1 for 
second source ). 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = tmpSrc3[511:0] 

} else { 
tmpSrc3[511:] = SwizzUpConvLoad fe4(tmpSrc3/m,) 

} 


for (n = @; n < 8; n++) { 
if(k1[n]==1 or *no write-maskx) { 
zmm1[it63:i] = tmpSrc3[it63:i] 
} else { 
zmm1[it63:i] = zmm2[it+63:i] 


} 


SIMD Floating-Point Exceptions 


None. 
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Denormal Handling 


Treat Input Denormals As Zeros : 


NO 


Flush Tiny Results To Zero : 


NO 


Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S ¢¢4 


MVEX.EH=0 

S25S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_m5128 
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_mm512_mask_blend_pd (_mmask8,_m5128, _m5128); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VBLENDMPS - Blend Float32 Vectors using the Instruction Mask 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vblendmps zmm1 {k1}, zmm2, Blend float32 vector zmm2 and float32 vector 

65 /r S'p32(zmm3/m_) S'f32(zmm3/m,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element blending between float32 vector zmm2 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or float32 vector 
zmm3, using the instruction mask as selector. The result is written into float32 vector 
zmm1. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector: every element of the destination is conditionally selected between first 
source or second source using the value of the related mask bit (0 for first source, 1 for 
second source ). 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = tmpSrc3[511:0] 

} else { 
tmpSrc3[511:] = SwizzUpConvLoad f32(tmpSrc3/m,) 

} 


for (n = @; n < 16; n++) { 
if(k1[n]==1 or *no write-maskx) { 
zmm1[it+31:i] = tmpSrc3[it31:i] 
} else { 
zmm1[it+31:i] = zmm2[i+31:i] 


} 


SIMD Floating-Point Exceptions 


Invalid. 
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Treat Input Denormals As Zeros : 
NO 


Flush Tiny Results To Zero : 
NO 


Memory Up-conversion: S 32 


$251.59 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S r39 


MVEX.EH=0 

S_5159 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


_mm512_mask_blend_ps (_mmask16,__m512,__m512); 


Instruction not available in these modes 
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Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VBROADCASTF32X4 - Broadcast 4xFloat32 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 1A vbroadcastf32x4 zmmi_ {k1}, Broadcast 4xfloat32 vector Us32(7m,) into vec- 
/r Us32(mz) tor zmm1, under write-mask. 

Description 


The 4, 8 or 16 bytes (depending on the conversion and broadcast in effect) at memory 
address m; are broadcast and/or converted to a float32 vector. The result is written into 
float32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


// {4to16} 
tmpSrc2[127:0] = UpConvLoad f32 (me) 
for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
i = 32n 
j =i & Ox7F 
zmm1[i+31:i] = tmpSrc2[j+31:j]) 
} 
} 


Flags Affected 


Invalid. 


Memory Up-conversion: U;35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] 16 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 8 

100 uint8 to float32 [rax] {uint8} 4 

101 sint8 to float32 [rax] {sint8} 4 

110 uint16 to float32 [rax] {uint16} 8 

111 sint16 to float32 [rax] {sint16} 8 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 _mm512_extload_ps (void const*, MM_UPCONV_PS_ENUM, 
_MM_BROADCAST32_ENUM, int); 
—m512 _mm512_mask_extload_ps (_m512, __mmask16, void 


const*, MM_UPCONV_PS_ENUM, -MM_BROADCAST32_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VBROADCASTF64X4 - Broadcast 4xFloat64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 1B vbroadcastf64x4 zmmi_ {k1}, Broadcast 4xfloat64 vector Usg4(mz) into vec- 
/v Urealme) tor zmm1, under write-mask. 

Description 


The 32 bytes at memory address m; are broadcast to a float64 vector. The result is written 


into float64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


// {4to8} 
tmpSrce2[255:0] = UpConvLoad fea (mz) 
for (n = @; n < 8; n++) { 
if (ki[n] != 0) { 
i = 64*n 
j= i & OxFF 
zmm1[it63:i] = tmpSrc2[j+63:j]) 
} 
} 


Flags Affected 


None. 


Memory Up-conversion: U s¢4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 32 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512d _mm512_extload_pd (void const*, MM_UPCONV_PD_ENUM, 
_MM_BROADCAST64_ENUM, int); 
_—m512d _mm512_mask_extload_pd (_m512, _mmasks, void const*, 


_MM_UPCONV_PD_ENUM, MM_BROADCAST64_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VBROADCASTI32X4 - Broadcast 4xInt32 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.WO 5A _ vbroadcasti32x4 zmm1_ {k1}, Broadcast 4xint32 vector Uiz2(mz) into vector 
/v Ui3z2(m:) zmm1, under write-mask. 

Description 


The 4, 8 or 16 bytes (depending on the conversion and broadcast in effect) at memory 
address m; are broadcast and/or converted to a int32 vector. The result is written into 


int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


// {4to16} 
tmpSrc2[127:0] = UpConvLoad;32 (mz) 
for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
i = 32n 
j =i & Ox7F 
zmm1[i+31:i] = 
} 
} 


tmpSrc2[j+31:j]) 


Flags Affected 


None. 


Memory Up-conversion: U,35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] 16 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 4 

101 sint8 to sint32 [rax] {sint8} 4 

110 uint16 to uint32 [rax] {uint16} 8 

111 sint16 to sint32 [rax] {sint16} 8 
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> 
D 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_extload_epi32 (void const*, MM_UPCONV_EPI32_ENUM, 
_MM_BROADCAST32_ENUM, int); 
_m512i _mm512_mask_extload_epi32 (_m512i, __mmask16, void const*, 


_MM_UPCONV_EPI32_ENUM, MM_BROADCAST32_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VBROADCASTI64X4 - Broadcast 4xInt64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 5B vbroadcasti64x4 zmm1_ {k1}, Broadcast 4xint64 vector Uig4(mz) into vector 
/v Uiea(me) zmm1, under write-mask. 

Description 


The 32 bytes at memory address m;, are broadcast to a int64 vector. The result is written 


into int64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


// {4to8} 
tmpSrc2[255:0] = UpConvLoadjg4(m:z) 
for (n = @; n < 8; n++) { 
if (ki[n] != 0) { 
i = 64*n 
j= i & OxFF 
zmm1[it63:i] = tmpSrc2[j+63:j]) 
} 
} 


Flags Affected 


None. 


Memory Up-conversion: U4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 32 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_extload_epi64 (void const*, MM_UPCONV_EPI64_ENUM, 
_MM_BROADCAST64_ENUM, int); 
_m512i _mm512_mask_extload_epi64 (_m512i, —_mmaské8, void const*, 


_MM_UPCONV_EPI64_ENUM, MM_BROADCAST64_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VBROADCASTSD - Broadcast Float64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 19  vbroadcastsd zmm1 {k1}, Broadcast float64 vector Ur¢4(m:) into vector 
/r Urealme) zmm1, under write-mask. 

Description 


The 8 bytes at memory address m, are broadcast to a float64 vector. The result is written 


into float64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


// {1t08} 
tmpSrc2[63:0] = UpConvLoad fea (mz) 
for (n = @; n < 8; n++) { 
if (k1[n] != 0) { 
i = 64x*n 
zmm1[it+63:i] = tmpSrc2[63:0] 
} 
} 


Flags Affected 


None. 


Memory Up-conversion: U s¢4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512d _mm512_extload_pd (void const*, MM_UPCONV_PD_ENUM, 
_MM_BROADCAST64_ENUM, int); 
_—m512d _mm512_mask_extload_pd (_m512, _mmasks, void const*, 


_MM_UPCONV_PD_ENUM, MM_BROADCAST64_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VBROADCASTSS - Broadcast Float32 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W0O 18  vbroadcastss zmm1 {k1}, Broadcast float32 vector Ur32(m;,) into vector 
/r Uy32(me) zmm1, under write-mask. 

Description 


The 1, 2, or 4 bytes (depending on the conversion and broadcast in effect) at memory 
address m; are broadcast and/or converted to a float32 vector. The result is written into 
float32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


// {1to16} 
tmpSrc2[31:0] = UpConvLoad 32 (me) 
for (n = @; n < 16; n++) { 
if (k1[n] != ) { 
i = 32n 
zmm1[it+31:i] = tmpSrc2[31:0] 
} 
} 


Flags Affected 


Invalid. 


Memory Up-conversion: U ;35 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 _mm512_extload_ps (void const*, MM_UPCONV_PS_ENUM, 
_MM_BROADCAST32_ENUM, int); 
_—m512 _mm512_mask_extload_ps (_m512, _mmask16, void const*, 


_MM_UPCONV_PS_ENUM, _MM_BROADCAST32_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VCMPPD - Compare Float64 Vectors and Set Vector Mask 


Opcode Instruction Description 

MVEX.NDS.512.66.0RW1 C2 /rib = vcmppd k2 {k1}, zmm1, Sg4(zmm2/m;),imm8 Compare between 
float64 vector zmm1 
and float64 vector 
Spoa(zmm2/m;) 
and store the re- 
sult in k2, under 
write-mask. 


Description 


Performs an element-by-element comparison between float64 vector zmm1 and the 
float64 vector result of the swizzle/broadcast/conversion from memory or float64 vector 
zmmz2. The result is written into vector mask k2. 


Note: If DAZ=1, denormals are treated as zeros in the comparison (original source regis- 
ters untouched). untouched). +0 equals —0. Comparison with NaN returns false. 


Infinity of like signs, are considered equals. Infinity values of either signs are considered 
ordered values. 


Table 6.3 summarizes VCMPPD behavior, in particular showing how various NaN results 
can be produced. 


Predicate | Imm8 enc | Description Emulation If NaN | QNaN operand signals invalid 
{eq} 000 A=B False No 
{It} 001 A<B False Yes 
{le} 010 A<=B False Yes 
{gt} A>B Swap operands, use LT False Yes 
{ge} A>=B Swap operands, use LE False Yes 
{unord} 011 Unordered True No 
{neq} 100 NOT(A = B) True No 
{nlt} 101 NOT(A < B) True Yes 
{nle} 110 NOT(A <= B) True Yes 
{ngt} NOT(A > B) Swap operands, use NLT | True Yes 
{nge} NOT(A >= B) | Swap operands, use NLE | True Yes 
{ord} 111 Ordered False No 


Table 6.3: VCMPPD behavior 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
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write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 


Immediate Format 


Operation 


switch (IMM8[2 


case 
case 
case 
case 
case 
case 
case 
case 


} 


if(source is a register operand and MVEX.EH bit is 1) { 


Q: 


NOOB WD 


OP 
oP 
oP 
oP 
oP 
oP 
OP 
OP 


Comparison Type Ing IT, Ip 

eq Equal 0 oO O 
It Less than 0 oO 1 
le Less than or Equal 0 1 0 
unord | Unordered 0 1 1 
neq | Not Equal 1 O O 
nit Not Less than 1 Oo 1 
nle NotLessthanorEqual | 1 1 0 
ord Ordered 1 1 1 


:0]) { 


si i Ua 


EQ; break; 
LT; break; 
LE; break; 
UNORD; break; 
NEQ; break; 
NLT; break; 
NLE; break; 
ORD; break; 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] 


} else { 


tmpSrc2[511:0] = SwizzUpConvLoad fg4(zmm2/m,) 


t 


for (n = 


k2[n] = @ 
if(k1[n] != @) { 
1 = 64*n 


= zmm2[511:0] 


Q@; n < 8; nt+) { 


// float64 operation 
k2Cn] = (zmm1[i+63:i] OP tmpSrc2[it+63:i]) ? 1 


3 
I 


k2[15:8] = @ 


: @ 
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Instruction Ps 


eudo-ops 


Compilers and assemblers may implement the following pseudo-ops in addition to the 


standard ins 


truction op: 


Pseudo-Op 


Implementation 


vempeqpd k2 {k1}, zmm1, Sg(zmm2/m;) vemppd k2 {k1}, zmm1, Sa(zmm2/mz,), {eq} 


vempltpd k2 {k1}, zmm1, Sg(zmm2/m,) 


vemppd k2 {k1}, zmm1, Sg(zmm2/m,), {It} 


vemplepd k2 {k1}, zmm1, Sg(zmm2/m;) 


vemppd k2 {k1}, zmm1, Sa(zmm2/mz,), {le} 


vempunordpd k2 {k1}, zmm1, Sg(zmm2/m,) vemppd k2 {k1}, zmm1, Sg(zmm2/m,), {unord} 

vempnegqpd k2 {k1}, zmm1, Sa(zmm2/m;) vemppd k2 {k1}, zmm1, Sa(zmm2/mz,), {neq} 

vempnitpd k2 {k1}, zmm1, Sg(zmm2/m,) vemppd k2 {k1},zmm1, Sg(zmm2/m,), {nlt} 

vempnlepd k2 {k1}, zmm1, Sg(zmm2/m:) vemppd k2 {k1}, zmm1, Sa(zmm2/m,), {nle} 

vempordpd k2 {k1}, zmm1, S,(zmm2/m;) vemppd k2 {k1}, zmm1, Sg(zmm2/m,), {ord} 
SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S ¢¢4 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S_515p9 || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

S2515o9 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask8 _mm512_cmpeq_pd_mask (_m512d,__m512d); 

_mmask8 _mm512_mask_cmpeq_pd_mask(_mmask8,__m512d,__m512d); 
_mmask8 _mm51_cmplt_pd_mask(_m512d,_m512d); 

_mmask8 _mm512_mask_cmplt_pd_mask(_mmask8,_m512d,__m512d); 
_mmask8 _mm512_cmple_pd_mask(_m512d,_m512d); 

_mmask8 _mm512_mask_cmple_pd_mask(_mmask8, _m512d,__m512d); 
_mmask8 _mm512_cmpunord_pd_mask(_m512d,__m512d); 

_mmask8 _mm512_mask_cmpunord_pd_mask(_mmask8, _m512d,__m512d); 
_mmask8 _mm512_cmpneq_pd_mask(_m512d,__m512d); 
_mmask8 _mm512_mask_cmpneq_pd_mask(_mmask8, __m512d 
_mmask8 _mm512_cmpnlt_pd_mask(_m512d,_m512d); 
_mmask8 _mm512_mask_cmpnlt_pd_mask(_mmask8,_m512d,__m512d); 
_—mmask8 _mm512_cmpnle_pd_mask(_m512d,_m512d); 

_mmask8 _mm512_mask_cmpnle_pd_mask(_mmask8, _m512d,__m512d); 
_mmask8 _mm512_cmpord_pd_mask(_m512d,_m512d); 

_mmask8 _mm512_mask_cmpord_pd_mask(_mmask8,__m512d,__m512d); 


m512d); 


Da 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


147 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


(intel. 


VCMPPS - Compare Float32 Vectors and Set Vector Mask 


Opcode 


MVEX.NDS.512.0F.W0 C2 /r ib 


Instruction 
vempps k2 {k1}, zmm1, S32(zmm2/m,), imm8 


Description 

Compare between 
float32 vector zmm1 
and float32 ~—svvecctor 
Sp32(zmm2/m,) and 
store the result in k2, 
under write-mask. 


Description 


Performs an element-by-element comparison between float32 vector zmm1 and the 
float32 vector result of the swizzle/broadcast/conversion from memory or float32 vector 
zmmz2. The result is written into vector mask k2. 


Note: If DAZ=1, denormals are treated as zeros in the comparison (original source regis- 
ters untouched). untouched). +0 equals —0. Comparison with NaN returns false. 


Infinity of like signs, are considered equals. Infinity values of either signs are considered 
ordered values. 


Table 6.4 summarizes VCMPPS behavior, in particular showing how various NaN results 
can be produced. 


Predicate | Imm8 enc | Description Emulation If NaN | QNaN operand signals invalid 
{eq} 000 A=B False No 
{It} 001 A<B False Yes 
{le} 010 A<=B False Yes 
{gt} A>B Swap operands, use LT False Yes 
{ge} A>=B Swap operands, use LE False Yes 
{unord} 011 Unordered True No 
{neq} 100 NOT(A = B) True No 
{nt} 101 NOT(A < B) True Yes 
{nle} 110 NOT(A <= B) True Yes 
{ngt} NOT(A > B) Swap operands, use NLT | True Yes 
{nge} NOT(A >= B) | Swap operands, use NLE | True Yes 
{ord} 111 Ordered False No 
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Table 6.4: VCMPPS behavior 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 
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Immediate Format 


Operation 


switch (IMM8[2 


case 
case 
case 
case 
case 
case 
case 
case 


t 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 


Q: 


NOOB WD 


oP 
oP 
OP 
OP 
OP 
OP 
OP 
oP 


Comparison Type In I, Ip 

eq Equal 0 oO O 
It Less than 0 oO 1 
le Less than or Equal 0 1 0 
unord | Unordered 0 1 1 
neq | Not Equal 1 0 O 
nit Not Less than 1 Oo 1 
nle NotLessthanorEqual | 1 1 0 
ord Ordered 1 1 1 


:0]) { 


rete t to? 


tmpSrc2[511:0] 


} else { 


tmpSrc2[511:0] = SwizzUpConvLoad ¢32 (zmm2/m) 


} 


EQ; break; 
LT; break; 
LE; break; 
UNORD; break; 
NEQ; break; 
NLT; break; 
NLE; break; 
ORD; break; 


= zmm2[511:0] 


for (n = @; n < 16; n++) { 


k2[n] = @ 


if(ki[n] != @) { 


i = 32*n 


// float32 operation 
k2En] = (zmm1[i+31:i] OP tmpSrc2[i+31:i]) ? 1 


Z 
t 
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Instruction Pseudo-ops 


Compilers and assemblers may implement the following pseudo-ops in addition to the 


standard instruction op: 


Pseudo-Op 


Implementation 


vempeaps k2 {k1}, zmm1, S¢(zmm2/m,) 


vempps k2 {k1}, zmm1, S;(zmm2/mz,), {eq} 


vempltps k2 {k1}, zmm1, S;(zmm2/m;) 


vempps k2 {k1}, zmm1, S¢(zmm2/m,), {It} 


vempleps k2 {k1}, zmm1, S¢(zmm2/m,) 


vempps k2 {k1}, zmm1, S¢(zmm2/mz,), {le} 


vempunordps k2 {k1}, zmm1, S'¢(zmm2/m;) 


vempps k2 {k1}, zmm1, S+(zmm2/m,), {unord} 


vempnegps k2 {k1}, zmm1, S'¢(zmm2/m;) 


vempps k2 {k1}, zmm1, S+(zmm2/m,), {neq} 


vempnitps k2 {k1}, zmm1, S¢(zmm2/m,) 


vempps k2 {k1}, zmm1, S¢(zmm2/m,), {nlt} 


vempnleps k2 {k1}, zmm1, S'¢(zmm2/m;) 


vempps k2 {k1}, zmm1, S+(zmm2/m,), {nle} 


vempordps k2 {k1}, zmm1, S(zmm2/m) 


vempps k2 {k1}, zmm1, S¢(zmm2/m,), {ord} 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S ¢32 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S_51Sp9 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

S2515o9 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_cmpeq_ps_mask (_m512,__m512); 


_mmask16 _mm512_mask_cmpeq_ps_mask (_mmask16,__m512,__m512); 
_mmask16 _mm51_cmplt_ps_mask (_m512,__m512); 
_—mmask16 _mm512_mask_cmplt_ps_mask (_mmask16,__m512,__m512); 


_mmask16 _mm512_cmple_ps_mask (_m512,__m512); 

_—mmask16 _mm512_mask_cmple_ps_mask (_mmask16,_m512,_m512); 
_—mmask16 _mm512_cmpunord_ps_mask (_m512,__m512); 

_mmask16 _mm512_mask_cmpunord_ps_mask (_.mmask16,_m512,__m512); 
_mmask16 _mm512_cmpneq_ps_mask (_m512,__m512); 

_mmask16 _mm512_mask_cmpneq_ps_mask (_mmask16,__m512,__m512); 
—mmask16 _mm512_cmpnlt_ps_mask (_m512,__m512); 

_mmask16 _mm512_mask_cmpnlt_ps_mask (_mmask16,_m512,__m512); 
_mmask16 _mm512_cmpnle_ps_mask (_m512,__m512); 

—mmask16 _mm512_mask_cmpnle_ps_mask (_mmask16,_m512,__m512); 
_—mmask16 _mm512_cmpord_ps_mask (_m512,__m512); 

—mmask16 _mm512_mask_cmpord_ps_mask (_.mmask16,_m512,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VCVTDQ2PD - Convert Int32 Vector to Float64 Vector 


Opcode Instruction Description 

MVEX.512.F3.0FW0 E6 /r  vcvtdq2pd zmm1 {k1}, Sj32(zmm2/m;) Convert int32 vector 
Sizo(zmm2/m;,) to float64, and 
store the result in zmm1, under 
write-mask. 


Description 


Performs an element-by-element conversion from the int32 vector result of the swiz- 
zle/broadcast/conversion from memory or int32 vector zmm2 to a float64 vector . The 
result is written into float64 vector zmm1. The int32 source is read from either the lower 
half of the source operand (int32 vector zmm2), full memory source (8 elements, i.e. 256- 
bits) or the broadcast memory source. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[255:0] = zmm2[255:0] 

} else { 
tmpSrc2[255:0] = SwizzUpConvLoad;32(zmm2/m+) 

} 


for (n = @; n < 8; n++) { 
if(ki[n] != @) { 
i = 64x*n 
j = 32«n 
zmm1[i+63:i] = 
CvtInt32ToFloat64(tmpSrc2[j+31:j]) 


SIMD Floating-Point Exceptions 


None. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
Not Applicable 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 32 

001 broadcast 1 element (x8) [rax] {1to8} 4 

010 broadcast 4 elements (x4) | [rax] {4to8} 16 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S;35 


MVEX.EH=0 

S2S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_m512d = _mm512_cvtepi32lo_pd (_m512i); 
_m512d _mm512_mask_cvtepi32lo_pd (_m512d,_mmask8, _m512i); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 


to 4, 16 or 32-byte (depending on the swizzle broadcast). 


For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv involving data conversion. 

If SwizzUpConvMem function from memory is set to any 
value different than "no action", {1t08} or{4to8} 

then an Invalid Opcode fault is raised. Note 

that this rule only applies to memory conversions 
(register swizzles are allowed). 
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VCVTFXPNTDQ2PS - Convert Fixed Point Int32 Vector to Float32 Vector 


Opcode Instruction 


MVEX.512.0F3A.W0 CB /rib  vcvtfxpntdq2ps zmm1 {k1}, S;32(zmm2/m,), imms 


Description 

Convert int32 vector 
Sj32(zmm2/m:z) to 
float32, and store 
the result in zmm1, 
using imm8, under 
write-mask. 


Description 


Performs an element-by-element conversion from the int32 vector result of the swiz- 
zle/broadcast/conversion from memory or int32 vector zmm2 to a float32 vector , then 


performs an optional adjustment to the exponent. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


Exponent Adjustment | value Iz Ig Is I 
0 2° (32.0-no exponent adjustment) | 0 0 0 0 
4 27 (28.4) 0 0 0 1 
5 25 (27.5) 0 0 1 =O 
8 25 (24.8) 0 0 1 1 
16 21° (16.16) 0 1° 8 0 
24 274 (8.24) 0 ft 0) 7 
31 e131) 0 1 1 =O 
32 9°? (0.32) a ae 
reserved *must UD* 1 x x x 
Operation 


expadj = IMM8[6:4] 
if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc2[511:0] = zmm2[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc2[511:0] = SwizzUpConvLoad;32 (zmm2/m,) 
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for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
zmm1[i+31:i] = 
CvtInt32ToFloat32(tmpSrc2[i+31:i], RoundingMode) / EXPADJ_TABLELexpadj ] 


SIMD Floating-Point Exceptions 


Precision. 


Denormal Handling 


Treat Input Denormals As Zeros : 
Not Applicable 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S;35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 
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Register Swizzle: S;35 


MVEX.EH=0 

S2S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_cvtfxpnt_round_adjustepi32_ps(_m512i, int, MM_EXP_ADJ_ENUM); 
—m512 _mm512_mask_cvtfxpnt_round_adjustepi32_ps( _m512, _mmask16, _m512i, 
int, MM_EXP_ADJ_ ENUM); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If amemory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 
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If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VCVTFXPNTPD2DQ - Convert Float64 Vector to Fixed Point Int32 Vector 
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Opcode 


MVEX.512.F2.0F3A.W1 E6 /r ib 


Instruction 


vevtfxpntpd2dq zmm1 {k1}, S¢e4(zmm2/mz), 7mms 


Description 
Convert 

float64 vector 
Spoa(zmm2/mz) 
to int32, and 
store the result 
in zmm1,_ using 
mms, under 
write-mask. 


Description 


Performs an element-by-element conversion and rounding from the float64 vector result 
of the swizzle/broadcast/conversion from memory or float64 vector zmm2 to a int32 vec- 
tor. The int32 result is written into the lower half of the destination register zmm1; the 


other half of the destination is set to zero. 


Out-of-range values are converted to the nearest representable value and that NaNs con- 
vert to 0, because this makes the calculation of Exp2 more efficient (avoiding problems 
with converting very large values to integers, where undetected incorrect values could 
otherwise result from overflow). Table 6.5 describes what should be the result when deal- 
ing with floating-point special number. 


Table 6.5: Converting to integer special floating-point values behavior 


Input | Result 

NaN | 0 

+oo | INT_MAX 
+0 0 

-0 0 

—coo | INT_MIN 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


Rounding Mode I, Io 
rn | Round to Nearest (even) 0 O 
rd | Round Down (Round toward Negative Infinity) | 0 1 
ru | Round Up (Round toward Positive Infinity) 1 O 
rz | Round toward Zero 1 1 
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Operation 


RoundingMode = IMM8[1:0] 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad fg4(zmm2/m) 

} 


for (n = @; n < 8; n++) { 
if(k1[n] != ) { 
i = 64x*n 
j = 32«n 
zmm1[j+31:j] = 
CvtFloat64ToInt32(tmpSrc2Li+63:i], RoundingMode) 
} 
} 


zmm1[£511:256] = @ 


SIMD Floating-Point Exceptions 


Invalid, Precision. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S ¢¢4 


$2519 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_cvtfxpnt_roundpd_epi32lo(_m512d, int); 
_—m512i _mm512_mask_cvtfxpnt_roundpd_epi32lo(_m512i,__mmask8, _m512d, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VCVTFXPNTPD2UDQ - Convert Float64 Vector to Fixed Point Uint32 Vec- 
tor 


Opcode Instruction Description 

MVEX.512.F2.0F3A.W1 CA /rib  vcvtfxpntpd2udq zmm1 {k1}, Srg4(zmm2/m;),imm8& Convert 
float64 vector 
Spoa(zmm2/mz) 
to uint32, and 
store the result 
in zmm1, using 
mms, under 
write-mask. 


Description 


Performs an element-by-element conversion and rounding from the float64 vector result 
of the swizzle/broadcast/conversion from memory or float64 vector zmm2 to a uint32 
vector. The uint32 result is written into the lower half of the destination register zmm1; 
the other half of the destination is set to zero. 


Out-of-range values are converted to the nearest representable value and that NaNs con- 
vert to 0, because this makes the calculation of Exp2 more efficient (avoiding problems 
with converting very large values to integers, where undetected incorrect values could 
otherwise result from overflow). Table 6.6 describes what should be the result when deal- 
ing with floating-point special number. 


Input | Result 


NaN | 0 
+00 INT_MAX 
+0 0 
-0 0 


—oo INT_MIN 


Table 6.6: Converting to integer special floating-point values behavior 
This instruction is write-masked, so only those elements with the corresponding bit set 


in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


Rounding Mode 

rn | Round to Nearest (even) 

rd | Round Down (Round toward Negative Infinity) 
ru | Round Up (Round toward Positive Infinity) 

rz | Round toward Zero 


— 
ran 


Rl PR} o|lo 
RB] oOlR| ol 
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Operation 


RoundingMode = IMM8[1:0] 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad fg4(zmm2/m) 

} 


for (n = @; n < 8; n++) { 
if(k1[n] != ) { 
i = 64x*n 
j = 32«n 
zmm1[j+31:j] = 
CvtFloat64ToUint32(tmpSrc2[it63:i], RoundingMode) 
} 
} 


zmm1£511:256] = @ 


SIMD Floating-Point Exceptions 


Invalid, Precision. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S ¢¢4 


$2519 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_cvtfxpnt_roundpd_epi32lo(_m512d, int); 
_—m512i _mm512_mask_cvtfxpnt_roundpd_epi32lo(_m512i,__mmask8, _m512d, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction Description 

MVEX.512.66.0F3A.W0 CB /rib  vcvtfxpntps2dq zmm1 {k1}, S32(zmm2/m;),imm8 Convert 
float32 vector 
Sp32(zmm2/mz) 
to int32, and 
store the result 
in zmm1, using 
imms, under 
write-mask. 


Description 


Performs an element-by-element conversion and rounding from the float32 vector result 
of the swizzle/broadcast/conversion from memory or float32 vector zmm2 to a int32 vec- 
tor , with an optional exponent adjustment before the conversion. 


Out-of-range values are converted to the nearest representable value and that NaNs con- 
vert to 0, because this makes the calculation of Exp2 more efficient (avoiding problems 
with converting very large values to integers, where undetected incorrect values could 
otherwise result from overflow). Table 6.7 describes what should be the result when deal- 
ing with floating-point special number. 


Input | Result 


NaN | 0 
+00 INT_MAX 
+0 0 
-0 0 


—oo INT_MIN 


Table 6.7: Converting to integer special floating-point values behavior 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


Rounding Mode I, Io 
rn | Round to Nearest (even) 0 O 
rd | Round Down (Round toward Negative Infinity) | 0 1 
ru | Round Up (Round toward Positive Infinity) 1 O 
rz | Round toward Zero 1 1 
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Exponent Adjustment | value Iz Ig Is I 
0 2° (32.0- no exponent adjustment) | 0 0 0 0 
4 27 (28.4) 0 0 0 1 
5 9° (27.5) 0 0 1 =O 
8 25 (24.8) 0 0 1 4 
16 27 (16.16) oO £' 2) oO 
24 274 (8.24) 0 1 0 1 
31 or 131) 0 1 1 =O 
32 9° (0.32) a a res | 
reserved *must UD* 1 x x x 
Operation 


RoundingMode = IMM8[1:@] 
expadj = IMM8[6:4] 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad ¢32 (zmm2/m) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32an 
zmm1[i+31:i] = 
CvtFloat32ToInt32(tmpSrc2[i+31:i] * EXPADJ_TABLELexpadj], Rounding- 
Mode) 
} 
} 


SIMD Floating-Point Exceptions 


Invalid, Precision. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 
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Memory Up-conversion: S 32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ;39 


MVEX.EH=0 
S55S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 
525159 || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i 
—m512i 


Exceptions 


_mm512_cvtfxpnt_round_adjustps_epi32(__m512, int, MM_EXP_ADJ_ENUM); 
_mm512_mask_cvtfxpnt_round_adjustps_epi32( _m512i, _mmask16, _m512, 


int, MM_EXP_ADJ_ENUM); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 
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Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VCVTFXPNTPS2UDQ - Convert Float32 Vector to Fixed Point Uint32 Vec- 


tor 
Opcode Instruction Description 
MVEX.512.66.0F3A.W0 CA /rib  vcvtfxpntps2udq zmm1 {k1}, S32(zmm2/m,),imm8 Convert 
float32 vector 
S32(zmm2/mz) 
to uint32, and 
store the result 
in zmm1, using 
imms, under 
write-mask. 
Description 
Performs an element-by-element conversion and rounding from the float32 vector result 
of the swizzle/broadcast/conversion from memory or float32 vector zmm2 to a uint32 
vector , with an optional exponent adjustment before the conversion. 
Out-of-range values are converted to the nearest representable value and that NaNs con- 
vert to 0, because this makes the calculation of Exp2 more efficient (avoiding problems 
with converting very large values to integers, where undetected incorrect values could 
otherwise result from overflow). Table 6.8 describes what should be the result when deal- 
ing with floating-point special number. 
Input | Result 
NaN | 0 
t+oo | INT_MAX 
+0 0 
-0 0 
—coo | INT_MIN 
Table 6.8: Converting to integer special floating-point values behavior 
This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
Immediate Format 
Rounding Mode I, Ip 
rn | Round to Nearest (even) 0 O 
rd | Round Down (Round toward Negative Infinity) | 0 1 
ru | Round Up (Round toward Positive Infinity) 1 O 
rz | Round toward Zero 1 1 
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Exponent Adjustment | value Iz Ig Is I 
0 2° (32.0- no exponent adjustment) | 0 0 0 0 
4 27 (28.4) 0 0 0 1 
5 9° (27.5) 0 0 1 =O 
8 25 (24.8) 0 0 1 4 
16 27 (16.16) oO £' 2) oO 
24 274 (8.24) 0 1 0 1 
31 or 131) 0 1 1 =O 
32 9° (0.32) a a res | 
reserved *must UD* 1 x x x 
Operation 


RoundingMode = IMM8[1:@] 
expadj = IMM8[6:4] 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad ¢32 (zmm2/m) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32an 
zmm1[i+31:i] = 
CvtFloat32ToUint32(tmpSrc2[i+31:i] * EXPADJ_TABLELexpadj], Rounding- 
Mode) 
} 
} 


SIMD Floating-Point Exceptions 


Invalid, Precision. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 
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Memory Up-conversion: S 32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ;39 


MVEX.EH=0 
S55S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 
525159 || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i 
—m512i 


Exceptions 


_mm512_cvtfxpnt_round_adjustps_epi32(__m512, int, MM_EXP_ADJ_ENUM); 
_mm512_mask_cvtfxpnt_round_adjustps_epi32( _m512i, _mmask16, _m512, 


int, MM_EXP_ADJ_ENUM); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VCVTFXPNTUDQ2PS - Convert Fixed Point Uint32 Vector to Float32 Vec- 


tor 


Opcode 


Instruction 
MVEX.512.0F3A.W0 CA /rib  vecvtfxpntudq2ps zmm1 {k1}, Sj32(zmm2/m,), «mms 


Description 

Convert uint32 vec- 
tor Si32(zmm2/mz) 
to float32, and store 
the result in zmm1, 
using imm8s, under 
write-mask. 


Description 


Performs an element-by-element conversion from the uint32 vector result of the swiz- 
zle/broadcast/conversion from memory or uint32 vector zmm2 to a float32 vector, then 
performs an optional adjustment to the exponent. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


Exponent Adjustment | value Tz Ig Is Ig 
0 2° (32.0- no exponentadjustment) | 0 0 0 0 
4 27 (28.4) 0 0 O0O 1 
5 2° (27.5) 0 0 1 O 
8 25 (24.8) 0 0 1 1 
16 2'© (16.16) 0 1 0 0 
24 274 (8.24) ae Cn 
31 2°) (1.31) 0 1 1 0 
32 28? (0:32) Of £4 
reserved *must UD* 1 x x x 
Operation 


expa 


if(source is a register operand and MVEX.EH bit is 1) { 


dj = IMM8[6:4] 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 


R 


oundingMode = 


SSS[1:0] 


tmpSrc2[511:0] = zmm2[511:0] 


} el 
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RoundingMode = MXCSR.RC 
tmpSrc2[511:0] = SwizzUpConvLoadj32 (zmm2/m,) 


J 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32xn 
zmm1[i+31:i] = 
CvtUint32ToFloat32(tmpSrc2Li+31:i], RoundingMode) / EXPADJ_TABLE[expadj] 
} 
} 


SIMD Floating-Point Exceptions 


Precision. 


Denormal Handling 


Treat Input Denormals As Zeros : 
Not Applicable 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S;35 


S251So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 
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Register Swizzle: S;35 


MVEX.EH=0 

S255 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VCVTPD2PS - Convert Float64 Vector to Float32 Vector 


Opcode Instruction Description 

MVEX.512.66.0RW15A/r  vcvtpd2ps zmm1 {k1}, Sye4(zmm2/m,) Convert float64 vector 
Syea(zmm2/m;,) to float32, and 
store the result in zmmi1, under 
write-mask. 


Description 


Performs an element-by-element conversion and rounding from the float64 vector result 
of the swizzle/broadcast/conversion from memory or float64 vector zmm2 to a float32 
vector . The result is written into float32 vector zmm1. The float32 result is written into 
the lower half of the destination register zmm1; the other half of the destination is set to 
zero. 


Input | Result 
NaN_ | Quietized NaN. Copy leading bits of float64 significand 


+00 +00 
+0 +0 
-0 —0 
—00 —00 


Table 6.9: Converting float64 to float32 special values behavior 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc2[511:0] = zmm2[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc2[511:0] = SwizzUpConvLoad fea (zmm2/m) 
} 


for (n = @; n < 8; n++) { 
if(ki[n] != @) { 
i = 644n 
j = 32«n 
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zmm1[j+31:j] = 
CvtFloat64ToFloat32(tmpSrc2[i+63:i], RoundingMode) 
} 
} 


zmm1£511:256] = @ 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S255 || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 _mm512_cvtpd_pslo (_m512d); 


—m512 _mm512_mask_cvtpd_pslo (_m512d,__mmask8, _m512d); 


— 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
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If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


182 Reference Number: 327364-001 


= 
=r 
é 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VCVTPS2PD - Convert Float32 Vector to Float64 Vector 


Opcode Instruction Description 
MVEX.512.0FW0 5A /r_ vcvtps2pd zmm1 {k1}, Sy32(zmm2/m,) Convert float32 


vector 


S32(zmm2/m,) to float64, and store 
the result in zmm1, under write-mask. 


Description 


Performs an element-by-element conversion and rounding from the float32 vector result 
of the swizzle/broadcast/conversion from memory or float32 vector zmm2 to a float64 
vector . The result is written into float64 vector zmm1. The float32 source is read from 
either the lower half of the source operand (float32 vector zmm2), full memory source (8 
elements, i.e. 256-bits) or the broadcast memory source. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[255:0] = zmm2[255:0] 

} else { 
tmpSrc2[255:0] = SwizzUpConvLoad ¢32 (zmm2/m) 

} 


for (n = @; n < 8; n++) { 
if(k1[n] != ) { 
i = 64x*n 
j = 32«n 
zmm1[i+63:i] = 
CvtFloat32ToFloat64(tmpSrc2[j+31:j]) 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S 32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 32 

001 broadcast 1 element (x8) [rax] {1to8} 4 

010 broadcast 4 elements (x4) | [rax] {4to8} 16 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S ¢35 


MVEX.EH=0 

S555 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_m512d _mm512_cvtpslo_pd (__m512); 
_m512d _mm512_mask_cvtpslo_pd (_m512d,__mmask8, __m512); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 


to 4, 16 or 32-byte (depending on the swizzle broadcast). 


For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv involving data conversion. 

If SwizzUpConvMem function from memory is set to any 
value different than "no action", {1t08} or{4to8} 

then an Invalid Opcode fault is raised. Note 

that this rule only applies to memory conversions 
(register swizzles are allowed). 
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VCVTUDQ2PD - Convert Uint32 Vector to Float64 Vector 


Opcode Instruction Description 


write-mask. 


MVEX.512.F3.0R WO 7A /r  vevtudq2pd zmm1 {k1}, Siz0(zmm2/m,) Convert uint32 
Sizo(zmm2/m,) to float6é4, and 
store the result in zmm1, under 


vector 


Description 


Performs an element-by-element conversion from the uint32 vector result of the swiz- 
zle/broadcast/conversion from memory or uint32 vector zmm2 to a float64 vector . The 
result is written into float64 vector zmm1. The uint32 source is read from either the lower 
half of the source operand (uint32 vector zmm2), full memory source (8 elements, i.e. 
256-bits) or the broadcast memory source. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[255:0] = zmm2[255:0] 

} else { 
tmpSrc2[255:0] = SwizzUpConvLoad;32(zmm2/m+) 

} 


for (n = @; n < 8; n++) { 
if(ki[n] != @) { 
i = 644n 
j = 32«n 
zmm1[i+63:i] = 
CvtUint32ToF loat64(tmpSrc2[j+31:j]) 


SIMD Floating-Point Exceptions 


None. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
Not Applicable 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 32 

001 broadcast 1 element (x8) [rax] {1to8} 4 

010 broadcast 4 elements (x4) | [rax] {4to8} 16 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S;35 


MVEX.EH=0 

S2S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_m512d 
_m512d 


_mm512_cvtepu32lo_pd (_m512i); 
_mm512_mask_cvtepu32lo_pd (_m512d,__mmask8,___m512i); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to 4, 16 or 32-byte (depending on the swizzle broadcast). 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
This instruction does not support any 
SwizzUpConv involving data conversion. 
If SwizzUpConvMem function from memory is set to any 
value different than "no action", {1t08} or{4to8} 
then an Invalid Opcode fault is raised. Note 
that this rule only applies to memory conversions 
(register swizzles are allowed). 
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VEXP223PS - Base-2 Exponential Calculation of Float32 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.WO C8 vexp223ps zmm1 {k1},zmm2/m, Calculate the approx. exp2 from int32 vector 
/v zmm2/m, and store the result in zmm1, under 
write-mask. 
Description 


Computes the element-by-element base-2 exponential computation of the int32 vector 
on memory or int32 vector zmm2 with 0.99ULP (relative error). Input int32 values are 
considered as fixed point numbers with a fraction offset of 24 bits (i.e. 8 MSBs correspond 
to sign and integer part; 24 LSBs correspond to fractional part). The result is written into 
float32 vector zmm1. 


exp2 of a FP input value is computed as a two-instruction sequence: 


1. vevtfxpntps2dq (with exponent adjustment, so that destination format is 32b, with 
8b for integer part and 24b for fractional part) 


2. vexp223ps 


All overflows are captured by the combination of the saturating behavior of vcvtfxp- 
ntps2dq instruction and the detection of MAX_INT/MIN_INT by the vexp223ps instruc- 
tion. Tiny input numbers are quietly flushed to the fixed-point value 0 by the vcvtfxp- 
ntps2dq instruction, which produces an overall output exp2(0) = 1.0f. 


The overall behavior of the two-instruction sequence is the following: 


¢ —oo returns +0.0f 

e +0.0f returns 1.0f (exact result) 

e +oo returns +00 (#Overflow) 

NaN returns 1.0f (#Invalid) 

n, where n is an integral value returns 2” (exact result) 


Input Result | Comments 
MIN_INT | +0.0f 
MAX_INT | +00 Raise #0 flag 


Table 6.10: vexp2_1ulp() special int values behavior 
This instruction is write-masked, so only those elements with the corresponding bit set 


in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Reference Number: 327364-001 189 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


> 
D 


Operation 


tmpSrc2[5 


11:0] = zmm2/m; 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 
} 
for (n = @; n < 16; n++) { 
if (k1[n] != 0) { 
i = 32xn 
zmm1[i+31:i] = exp2_lulp(tmpSrc2[it+31:i]) 
3 
} 
SIMD Floating-Point Exceptions 
Overflow. 
Denormal Handling 
Treat Input Denormals As Zeros : 
Not Applicable 
Flush Tiny Results To Zero : 
YES 
Register Swizzle 
MVEX.EH=0 
S55S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 reserved N/A 
010 reserved N/A 
011 reserved N/A 
100 reserved N/A 
101 reserved N/A 
110 reserved N/A 
111 reserved N/A 
MVEX.EH=1 
525159 || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_exp223_ps (_m512i); 
—m512 _mm512_mask_exp223_ps (_m512,_mmask16, _m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 
#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 


191 


(intel) 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VFIXUPNANPD - Fix Up Special Float64 Vector Numbers With NaN Passthrough 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.W1 55 /r__vfixupnanpd zmm1 {k1}, zmm2, Sjg4(zmm3/m:) Fix up, with NaN 
passthrough, spe- 
cial numbers in 
float64 vector 
zmm1, float64 
vector zmm2 
and int64 vector 
Siea4(zmm3/m:z) 
and store the result 
in zmmi, under 
write-mask. 


Description 


Performs an element-by-element fix-up of various real and special number types in 
the float64 vector zmm2 using the 21-bit table values from the result of the swiz- 
zle/broadcast/conversion process on memory or int64 vector zmm3. The result is 
merged into float64 vector zmm1. Unlike in vfixuppd, source NaN values are passed- 
through as quietized values. Note that, also unlike in vfixup, this quietization translates 
into a #IE exception flag being reported for input SNaNs. 


This instruction is specifically intended for use in fixing up the results of arithmetic cal- 
culations involving one source, although it is generally useful for fixing up the results of 
multiple-instruction sequences to reflect special-number inputs. For example, consider 
rcp(0). Input 0 to rcp, and you should get inf. However, evaluating rcp via 2x — ax? 
(Newton-Raphson), where « = approx(1/0) = ov, incorrectly yields NaN. To deal with 
this, vfixupps can be used after the N-R reciprocal sequence to set the result to oo when 
the input is 0. 


Denormal inputs must be treated as zeros of the same sign if DAZ is enabled. 


Note that NO_CHANGE_TOKEN leaves the destination (output) unchanged. This means 
that if the destination is a denormal, its value is not flushed to 0. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


enum TOKEN_TYPE 


{ 
NO_CHANGE_TOKEN = Q, 
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NEG_INF_TOKEN = 
NEG_ZERO_TOKEN = 
POS_ZERO_TOKEN = 
POS_INF_TOKEN = 
NAN_TOKEN = 
MAX_DOUBLE_TOKEN = 
MIN_DOUBLE_TOKEN = 


NOOB WDNY 


} 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpzmm3[511:0] = zmm3[511:0] 

} else { 
tmpzmm3[511:0] = SwizzUpConvLoad;g4(zmm3/m+) 

} 


for (n = @; n < 8; n++) { 
if (k1[n] != @) { 
i = 64x*n 
tsrc[63:0] = zmm2[i+63:i] 


if (IsNaN(tsrc[63:0]) 


{ 
zmm1[it+63:i] = QNaN(zmm2Cit+63:i]) 
} 
else 
{ 
// tmp is an int value 
if (tsrc[63:0] == -inf) tmp = Q 
else if (tsrc[63:0] < Q) tmp = 1 
else if (tsrc[63:0] == -@) tmp = 


2 
else if (tsrc[63:0] == +@) tmp = 3 
else if (tsrc[63:0] == inf) tmp = 5 
else /* tsrc[63:0] > @ */ tmp = 4 


table[20:0] = tmpzmm3[i+63:i] 

token = table[(tmp*3)+2: tmpx3] // table is viewed as one 21-bit 
// little-endian value. 
// token is an int value 
// the 7th entry is unused 


// float64 result 
if (token == NEG_INF_TOKEN) zmm1[it63:i] = -inf 
else if (token == NEG_ZERO_TOKEN)  zmm1[i+63:i] = -@ 
else if (token == POS_ZERO_TOKEN)  zmm1[i+63:i] = + 
else if (token == POS_INF_TOKEN) zmm1[it63:i] = +inf 
else if (token == NAN_TOKEN) zmm1[i+63:i] = QNaN_indefinite 
else if (token == MAX_DOUBLE_TOKEN) zmm1[i+63:i] = NMAX 
else if (token == MIN_DOUBLE_TOKEN) zmm1[i+63:i] = -NMAX 
else if (token == NO_CHANGE_TOKEN) { /* zmm1[i+63:i] remains un- 
changed */ } 
} 
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SIMD Floating-Point Exceptions 


Invalid. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
NO 


Memory Up-conversion: S,¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: Sie. 


MVEX.EH=0 

S5S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512d _mm512_fixupnan_pd (_m512d,_m512d,__m512i); 


_—m512d _mm512_mask_fixupnan_pd (_m512d,_mmask8,__m512d,__m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFIXUPNANPS - Fix Up Special Float32 Vector Numbers With NaN Passthrough 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.W0 55 /r__-vfixupnanps zmm1 {k1}, zmm2, S;32(zmm3/m;) Fix up, with NaN 
passthrough, spe- 
cial numbers in 
float32 vector 
zmm1, float32 
vector zmm2 
and int32 vector 
Si32(zmm3/m+) 
and store the result 
in zmmi, under 
write-mask. 


Description 


Performs an element-by-element fix-up of various real and special number types in 
the float32 vector zmm2 using the 21-bit table values from the result of the swiz- 
zle/broadcast/conversion process on memory or int32 vector zmm3. The result is 
merged into float32 vector zmm1. Unlike in vfixupps, source NaN values are passed- 
through as quietized values. Note that, also unlike in vfixup, this quietization translates 
into a #IE exception flag being reported for input SNaNs. 


This instruction is specifically intended for use in fixing up the results of arithmetic cal- 
culations involving one source, although it is generally useful for fixing up the results of 
multiple-instruction sequences to reflect special-number inputs. For example, consider 
rcp(0). Input 0 to rcp, and you should get inf. However, evaluating rcp via 2x — ax? 
(Newton-Raphson), where « = approx(1/0) = ov, incorrectly yields NaN. To deal with 
this, vfixupps can be used after the N-R reciprocal sequence to set the result to oo when 
the input is 0. 


Denormal inputs must be treated as zeros of the same sign if DAZ is enabled. 


Note that NO_CHANGE_TOKEN leaves the destination (output) unchanged. This means 
that if the destination is a denormal, its value is not flushed to 0. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


enum TOKEN_TYPE 


{ 
NO_CHANGE_TOKEN = Q, 
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NEG_INF_TOKEN = 
NEG_ZERO_TOKEN = 
POS_ZERO_TOKEN = 
POS_INF_TOKEN = 
NAN_TOKEN = 
MAX_FLOAT_TOKEN = 
MIN_FLOAT_TOKEN = 


NOOB WN 


} 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpzmm3[511:0] = zmm3[511:0] 

} else { 
tmpzmm3[511:0] = SwizzUpConvLoad;32 (zmm3/m+) 

} 


for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
1 = 32&n 
tsrc[31:0] = zmm2[it+31:i] 


if (IsNaN(tsrc[31:0]) 


{ 
zmm1[it+31:i] = QNaN(zmm2Cit+31:i]) 
} 
else 
{ 
// tmp is an int value 
if (tsrc[31:0] == -inf) tmp = Q 
else if (tsrc[31:0] < 0) tmp = 1 
else if (tsrc[31:0] == -@) tmp = 


2 
else if (tsrc[31:0] == +0) tmp = 3 
else if (tsrc[31:0] == inf) tmp = 5 
else /* tsrc[31:0] > @ x*/ tmp = 4 


table[20:0] = tmpzmm3[it+31:i] 

token = table[(tmp*3)+2: tmpx3] // table is viewed as one 21-bit 
// little-endian value. 
// token is an int value 
// the 7th entry is unused 


// float32 result 
if (token == NEG_INF_TOKEN) zmm1[it+31:i] = -inf 
else if (token == NEG_ZERO_TOKEN)  zmm1[i+31:i] = -@ 
else if (token == POS_ZERO_TOKEN)  zmm1[i+31:i] = + 
else if (token == POS_INF_TOKEN) zmm1[it31:i] = t+inf 
else if (token == NAN_TOKEN) zmm1[i+31:i] = QNaN_indefinite 
else if (token == MAX_FLOAT_TOKEN) zmm1[i+31:i] = NMAX 
else if (token == MIN_FLOAT_TOKEN) zmm1[i+31:i] = -NMAX 
else if (token == NO_CHANGE_TOKEN) { /* zmm1[i+31:i] remains un- 
changed */ } 
} 
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SIMD Floating-Point Exceptions 


Invalid. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
NO 


Memory Up-conversion: S;35 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 
S5S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 
525159 || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fixupnan_ps (_m512,__m512,_m512i); 


—m512 _mm512_mask_fixupnan_ps (_m512,_mmask16,__m512,_m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMADD132PD - Multiply Destination By Second Source and Add To First 
Source Float64 Vectors 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W1 vfmadd132pd zmm1 {k1}, zmm2, Multiply float64 vector zmm1 and float64 vec- 
98 /r Sea(zmm3/m+) tor Sg4(zmm3/m;), add the result to float64 


vector zmm2, and store the final result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element multiplication between float64 vector zmm1 and the 
float64 vector result of the swizzle/broadcast/conversion process on memory or vector 
float64 zmm3, then adds the result to float64 vector zmmz2. The final sum is written into 
float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m) 
3 


for (n = @; n < 8; n++) { 
if(k1[n] != ) { 
i = 644n 
// float64 operation 
zmm1[i+63:i] = zmm1[i+63:i] * tmpSrc3[i+63:i] + zmm2[i+63:i] 
} 
} 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fmadd_pd (_m512d,__m512d,_m512d); 


_—m512d _mm512_mask_fmadd_pd (__m512d,__mmask8,_m512d,_m512d); 


—m512d _mm512_mask3_fmadd_pd (_m512d 


De 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


m512d,__m512d 


a 


mmask8); 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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#GP(0) 


#PF(fault-code) 
#NM 
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in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMADD132PS - Multiply Destination By Second Source and Add To First 
Source Float32 Vectors 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W0O vfmadd132ps zmm1 {k1}, zmm2, Multiply float32 vector zmm1 and float32 vec- 
98 /r S'p32(zmm3/mz) tor S'r32(zmm3/m;,), add the result to float32 


vector zmm2, and store the final result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element multiplication between float32 vector zmm1 and the 
float32 vector result of the swizzle/broadcast/conversion process on memory or vector 
float32 zmm3, then adds the result to float32 vector zmmz2. The final sum is written into 
float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
3 


for (n = @; n < 16; n++) { 
if(k1[n] != ) { 
1 = 32an 
// float32 operation 
zmm1[i+31:i] = zmm1[i+31:i] * tmpSrc3[i+31:i] + zmm2[i+31:i] 
} 
} 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S2515Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S -3 


MVEX.EH=0 

S551Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fmadd_ps (_m512,_m512,__m512); 
—m512 _mm512_mask_fmadd_ps (_m512,_mmask16,__m512,__m512); 
—m512 _mm512_mask3_fmadd_ps (_m512,__m512,__m512, _mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMADD213PD - Multiply First Source By Destination and Add Second 
Source Float64 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W1 A8 /r 


Instruction 
vfmadd213pd zmm1 {k1}, zmm2, S64(zmm3/m,) 


Description 

Multiply float64 
vector zmm2 
and float64 
vector zmm1, 
add the result to 
float64 vector 
Sea(zmm3/m,), 
and store the 
final result in 
zmm1, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float64 vector zmm2 and float64 
vector zmm1 and then adds the result to the float64 vector result of the swizzle/broadcast/conversion 
process on memory or vector float64 zmm3. The final sum is written into float64 vector 
zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 


tmpSrc3[511:] = 


zmm3[511:0] 


} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 


} 


for 


(n = @; n < 8; ntt+) { 


if(k1[n] != 0) { 


i = 64*n 
// float64 operation 
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zmm1[i+63:i] = zmm2[i+63:i] * zmm1[i+63:i] + tmpSrc3[i+63:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 564 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fmadd_pd (_m512d,_m512d,_m512d); 
_—m512d _mm512_mask_fmadd_pd (__m512d,__mmask8,_m512d,_m512d); 


—m512d _mm512_mask3_fmadd_pd (_m512d,__m512d,__m512d,__mmasks8); 
Exceptions 

Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 

Protected and Compatibility Mode 
#UD Instruction not available in these modes 

64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMADD213PS - Multiply First Source By Destination and Add Second 
Source Float32 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W0 A8 /r 


Instruction 
vfmadd213ps zmm1 {k1}, zmm2, S'f32(zmm3/m,) 


Description 

Multiply float32 
vector zmm2 
and float32 vec- 
tor zmmi, add 
the result to 
float32 vector 
Sys2(emm3/m,), 
and store’ the 
final result in 
zmm1, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float32 vector zmm2 and float32 
vector zmm1 and then adds the result to the float32 vector result of the swizzle/broadcast/conversion 
process on memory or vector float32 zmm3. The final sum is written into float32 vector 
zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


i 


f (SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 


tmpSrc3[511:] = 


zmm3[511:0] 


} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 


} 


for 


(n = @; n < 16; nt+) { 


if(k1[n] != 0) { 


i = 32x*n 
// float32 operation 
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zmm1[i+31:i] = zmm2[i+31:i] * zmm1[i+31:i] + tmpSrc3[i+31:i] 


} 
t 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


5251S 


Function: 


Usage 


disp8*N 


000 
001 
010 
011 
100 
110 
111 


no conversion 

broadcast 1 element (x16) 
broadcast 4 elements (x4) 
float16 to float32 

uint8 to float32 

uint16 to float32 

sint16 to float32 


[rax] {16to16} or [rax] 
[rax] {1to16} 

[rax] {4to16} 

[rax] {float16} 

[rax] {uint8} 

[rax] {uint16} 

[rax] {sint16} 


64 
4 

16 
32 
16 
32 
32 
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Register Swizzle: S rs. 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fmadd_ps (_m512,_m512,__m512); 
—m512 _mm512_mask_fmadd_ps (_m512,_mmask16,_m512,__m512); 
—m512 _mm512_mask3_fmadd_ps (_m512,__m512,__m512, _mmask16); 


— 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 


in a non-canonical form. 
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#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMADD231PD - Multiply First Source By Second Source and Add To Des- 


tination Float64 Vectors 


Opcode Instruction 


MVEX.NDS.512.66.0F38.W1 B8 /r vfmadd231pd zmm1 {k1}, zmm2, S'fe4(zmm3/m;) 


Description 

Multiply float64 
vector zmm2 and 
float64 vector 
Spoa(zmm3/m,), 
add the result to 
float64 vector 
zmm1, and store 
the final result 
in zmmi1, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float64 vector zmm2 and the 
float64 vector result of the swizzle/broadcast/conversion process on memory or vector 
float64 zmm3, then adds the result to float64 vector zmm1. The final sum is written into 
float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 

if(SSS[2]==1) Supress_Exception_Flags() // SAE 

// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 


RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
1 = 644n 
// float64 operation 
zmm1[i+63:i] = zmm2[i+63:i] * tmpSrc3[i+63:i] + zmm1[i+63:i] 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 564 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fmadd_pd (_m512d,_m512d,_m512d); 
_—m512d _mm512_mask_fmadd_pd (__m512d,__mmask8,_m512d,_m512d); 


—m512d _mm512_mask3_fmadd_pd (_m512d,__m512d,__m512d,__mmasks8); 
Exceptions 

Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 

Protected and Compatibility Mode 
#UD Instruction not available in these modes 

64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMADD231PS - Multiply First Source By Second Source and Add To Des- 


tination Float32 Vectors 


Opcode Instruction 


MVEX.NDS.512.66.0F38.W0 B8 /r vfmadd231ps zmm1 {k1}, zmm2, S'f32(zmm3/m;) 


Description 

Multiply float32 
vector zmm2 and 
float32 vector 
Sp32(zmm3/mz), 
add the result to 
float32 vector 
zmm1, and store 
the final result 
in zmmi1, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float32 vector zmm2 and the 
float32 vector result of the swizzle/broadcast/conversion process on memory or vector 
float32 zmm3, then adds the result to float32 vector zmm1. The final sum is written into 
float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 

if(SSS[2]==1) Supress_Exception_Flags() // SAE 

// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 


RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != ) { 
1 = 32an 
// float32 operation 
zmm1[it+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i] + zmm1[it+31:i] 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S251 So 


Function: 


Usage 


disp8*N 


000 
001 
010 
011 
100 
110 
111 


no conversion 

broadcast 1 element (x16) 
broadcast 4 elements (x4) 
float16 to float32 

uint8 to float32 

uint16 to float32 

sint16 to float32 


[rax] {16to16} or [rax] 
[rax] {1to16} 

[rax] {4to16} 

[rax] {float16} 

[rax] {uint8} 

[rax] {uint16} 

[rax] {sint16} 


64 
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Register Swizzle: S rs. 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fmadd_ps (_m512,_m512,__m512); 
—m512 _mm512_mask_fmadd_ps (_m512,_mmask16,_m512,__m512); 
—m512 _mm512_mask3_fmadd_ps (_m512,__m512,__m512, _mmask16); 


— 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 


in a non-canonical form. 
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#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMADD233PS - Multiply First Source By Specially Swizzled Second Source 
and Add To Second Source Float32 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.W0 A4 /r vfmadd233ps zmm1 {k1}, zmm2, S32(zmm3/m;,) Multiply float32 
vector zmm2 by 
certain elements 
of float32 vector 
Sys2(emm3/m,), 
add the re- 
sult to certain 
elements of 
Sp32(zmm3/m,z), 
and store’ the 
final result in 
zmm1, under 
write-mask. 


Description 


This instruction is built around the concept of 4-element sets, of which there are four: 
elements 0-3, 4-7, 8-11, and 12-15. If we refer to the float32 vector result of the broadcast 
(no conversion is supported) process on memory or the float32 vector zmm3 (no swizzle 
is supported) as t3, then: 


Each element 0-3 of float32 vector zmm2 is multiplied by element 1 of t3, the result is 
added to element 0 of t3, and the final sum is written into the corresponding element 0-3 
of float32 vector zmm1. 


Each element 4-7 of float32 vector zmm2 is multiplied by element 5 of t3, the result is 
added to element 4 of t3, and the final sum is written into the corresponding element 4-7 
of float32 vector zmm1. 


Each element 8-11 of float32 vector zmm2 is multiplied by element 9 of t3, the result is 
added to element 8 of t3, and the final sum is written into the corresponding element 8-11 
of float32 vector zmm1. 


Each element 12-15 of float32 vector zmm2 is multiplied by element 13 of t3, the result 
is added to element 12 of t3, and the final sum is written into the corresponding element 
12-15 of float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


This instruction makes it possible to perform scale and bias in a single instruction without 
needing to have either scale or bias already loaded in a register. This saves one vector load 
for each interpolant, representing around ten percent of shader instructions. 


For structure-of-arrays (SOA) operation, this instruction is intended to be used with the 
{4to16} broadcast on src2, allowing all 16 scale and biases to be identical. For array-of- 
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structures (AOS) vec4 operations, no broadcast is used, allowing four different scales and 
biases, one for each vec4. 


No conversion or swizzling is supported for this instruction. However, all broadcasts ex- 
cept {1to16} are supported (i.e. 16to16 and 4to16). 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 

} 


for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
i = 32xn 
base = ( n & ~@x@3 ) * 32 
scale[31:0] = tmpSrc3[baset+63:base+32] 
bias[31:0] = tmpSrc3[base+31:base] 
// float32 operation 
zmm1[i+31:i] = zmm2[i+31:i] * scale[31:0] + bias[31:0] 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


224 Reference Number: 327364-001 


= 
2 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


Memory Up-conversion: S 32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 reserved N/A N/A 
010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S ;3 


MVEX.EH=0 

S255 || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 reserved N/A 

010 reserved N/A 

011 reserved N/A 

100 reserved N/A 

101 reserved N/A 

110 reserved N/A 

111 reserved N/A 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 
000 Round To Nearest (even) , {rn} 
001 Round Down (-INF) , {rd} 
010 Round Up (+INF) , {ru} 
011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 
101 Round Down (-INF) with SAE , {rd-sae} 
110 Round Up (+INF) with SAE , {ru-sae} 
111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fmadd233_ps (_m512,_m512); 
—m512 _mm512_mask_fmadd233_ps (_m512,_mmask16,__m512,__m512); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to 16 or 64-byte (depending on the swizzle broadcast). 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
This instruction does not support any 
SwizzUpConv involving data conversion, register swizzling or 
{1to16} broadcast. If SwizzUpConv function is set to any 
value different than "no action" or {4to16} then 
an Invalid Opcode fault is raised 
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VFMSUB132PD - Multiply Destination By Second Source and Subtract 
First Source Float64 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W1 vfmsub132pd zmm1 {k1}, zmm2, Multiply float64 vector zmm1 and float64 vec- 
9A /r Sea(zmm3/m+) tor Syg4(zmm3/m;), subtract float64 vector 
zmmz2 from the result, and store the final result 
in zmm1, under write-mask. 
Description 


Performs an element-by-element multiplication of float64 vector zmm1 and the float64 
vector result of the swizzle/broadcast/conversion process on memory or vector float64 
zmm3, then subtracts float64 vector zmm2 from the result. The final result is written into 


float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 64*n 
// float64 operation 


zmm1[i+63:i] = zmm1[i+63:i] * tmpSrc3[i+63:i] - zmm2[i+63:i] 


} 
t 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fmsub_pd (_m512d,_m512d,_m512d); 
_—m512d _mm512_mask_fmsub_pd (_.m512d,_mmask8,_m512d,_m512d); 
—m512d _mm512_mask3_fmsub_pd (_.m512d,__m512d,__m512d, _mmask8); 


; on 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMSUB132PS - Multiply Destination By Second Source and Subtract First 
Source Float32 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W0O vfmsub132ps zmm1 {k1}, zmm2, Multiply float32 vector zmm1 and float32 vec- 
9A /r S'p32(zmm3/mz) tor Sy32(zmm3/m;), subtract float32 vector 
zmmz2 from the result, and store the final result 
in zmm1, under write-mask. 
Description 


Performs an element-by-element multiplication of float32 vector zmm1 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or vector float32 
zmm3, then subtracts float32 vector zmm2 from the result. The final result is written into 


float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
i = 32x*n 
// float32 operation 


zmm1[i+31:i] = zmm1[i+31:i] * tmpSrc3[i+31:i] - zmm2[i+31:i] 


} 
t 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S2515Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ¢35 


MVEX.EH=0 

S551Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fmsub_ps (_m512,_m512,__m512); 
—m512 _mm512_mask_fmsub_ps (_m512,_mmask16,_m512,__m512); 
—m512 _mm512_mask3_fmsub_ps (_m512,_m512,__m512,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


233 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VFMSUB213PD - Multiply First Source By Destination and Subtract Sec- 
ond Source Float64 Vectors 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W1 vfmsub213pd zmm1 {k1}, zmm2, Multiply float64 vector zmm2 and float64 
AA /r Sea(zmm3/mz) vector zmm1, _ subtract float64 vector 


Syea(zmm3/m;,) from the result, and store 
the final result in zmm1, under write-mask. 


Description 


Performs an element-by-element multiplication of float64 vector zmm2 and float64 vec- 
tor zmm1, then subtracts the float64 vector result of the swizzle/broadcast/conversion 
process on memory or vector float64 zmm3 from the result. The final result is written 
into float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 644n 
// float64 operation 
zmm1[i+63:i] = zmm2[i+63:i] * zmm1[i+63:i] - tmpSrc3[i+63:i] 
} 
} 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fmsub_pd (_m512d,_m512d,_m512d); 
_—m512d _mm512_mask_fmsub_pd (_.m512d,_mmask8,_m512d,_m512d); 
—m512d _mm512_mask3_fmsub_pd (_.m512d,__m512d,__m512d, _mmask8); 


; on 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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#GP(0) 


#PF(fault-code) 
#NM 
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in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMSUB213PS - Multiply First Source By Destination and Subtract Second 
Source Float32 Vectors 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W0 vfmsub213ps zmm1 {k1}, zmm2, Multiply float32 vector zmm2 and _ float32 
AA /r S'p32(zmm3/mz) vector zmm1, _ subtract float32 vector 


S'32(zmm3/m;,) from the result, and store 
the final result in zmm1, under write-mask. 


Description 


Performs an element-by-element multiplication of float32 vector zmm2 and float32 vec- 
tor zmm1, then subtracts the float32 vector result of the swizzle/broadcast/conversion 
process on memory or vector float32 zmm3 from the result. The final result is written 
into float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != @) { 
1 = 324n 
// float32 operation 
zmm1[i+31:i] = zmm2[i+31:i] * zmm1[i+31:i] - tmpSrc3[it+31:i] 
} 
} 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S2515Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S -3 


MVEX.EH=0 

S551Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fmsub_ps (_m512,_m512,__m512); 
—m512 _mm512_mask_fmsub_ps (_m512,_mmask16,_m512,_m512); 
—m512 _mm512_mask3_fmsub_ps (_m512,_m512,__m512,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMSUB231PD - Multiply First Source By Second Source and Subtract 
Destination Float64 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W1 vfmsub231pd zmm1 {k1}, zmm2, Multiply float64 vector zmm2 and float64 vec- 
BA /r Sea(zmm3/mz) tor Syg4(zmm3/m;), subtract float64 vector 
zmm1 from the result, and store the final result 
in zmm1, under write-mask. 
Description 


Performs an element-by-element multiplication of float32 vector zmm2 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or vector float32 
zmm3, then subtracts float32 vector zmm1 from the result. The final result is written into 


float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 64*n 
// float64 operation 


zmm1[i+63:i] = zmm2[i+63:i] * tmpSrc3[i+63:i] - zmm1[i+63:i] 


} 
t 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fmsub_pd (_m512d,_m512d,_m512d); 
_—m512d _mm512_mask_fmsub_pd (_.m512d,_mmask8,_m512d,_m512d); 
—m512d _mm512_mask3_fmsub_pd (_.m512d,__m512d,__m512d, _mmask8); 


; on 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFMSUB231PS - Multiply First Source By Second Source and Subtract 
Destination Float32 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W0O vfmsub231ps zmm1 {k1}, zmm2, Multiply float32 vector zmm2 and float32 vec- 
BA /r S'p32(zmm3/mz) tor Sy32(zmm3/m;), subtract float32 vector 
zmm1 from the result, and store the final result 
in zmm1, under write-mask. 
Description 


Performs an element-by-element multiplication of float32 vector zmm2 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or vector float32 
zmm3, then subtracts float32 vector zmm1 from the result. The final result is written into 


float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
i = 32x*n 
// float32 operation 


zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i] - zmm1[it+31:i] 


} 
t 
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SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S2515Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ¢35 


MVEX.EH=0 

S551Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_fmsub_ps (_m512,_m512,__m512); 
—m512 _mm512_mask_fmsub_ps (_m512,_mmask16,_m512,__m512); 
—m512 _mm512_mask3_fmsub_ps (_m512,_m512,__m512,__mmask16); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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From First Source Float64 Vectors 
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Opcode 
MVEX.NDS.512.66.0F38.W1 9C /r 


Instruction 
vfnmadd132pd zmm1 {k1}, zmm2, Spg4(zmm3/m;) 


Description 
Multiply 
float64 vec- 
tor zmm1 and 
float64 vector 
Spea(zmm3/mz), 
negate, and add 
the result to 
float64 vector 
zmmz2, and 
store the fi- 
nal result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication of float64 vector zmm2 and the float64 
vector result of the swizzle/broadcast/conversion process on memory or vector float64 
zmm3, then subtracts the result from float64 vector zmm1. The final result is written into 
float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 


RoundingMode = 
tmpSrc3[511:] 


} else { 


} 


for (n= 0; n< 8; 


RoundingMode = 


tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m) 


SSS[1:0] 
= zmm3[511:0] 


MXCSR.RC 


n++) { 
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if(ki[n] != @) { 
i = 644n 
// float64 operation 
zmm1[i+63:i] = -(zmm1[it63:i] * tmpSrc3[it+63:i]) + zmm2[i+63:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512d _mm512_fnmadd_pd (_m512d,_m512d,__m512d); 
_—m512d _mm512_mask_fnmadd_pd (_m512d,_mmask8,_m512d,_m512d); 
_—m512d _mm512_mask3_fnmadd_pd (_m512d,__m512d,__m512d,__mmask8); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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#NM 


Reference Number: 327364-001 


in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode 
MVEX.NDS.512.66.0F38.W0 9C /r 


Instruction 
vfnmadd132ps zmm1 {k1}, zmm2, S32(zmm3/m) 


Description 
Multiply 
float32 vec- 
tor zmm1 and 
float32 vector 
Sf32(zmm3/mz), 
negate, and add 
the result to 
float32 vector 
zmmz2, and 
store the fi- 
nal result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication of float32 vector zmm2 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or vector float32 
zmm3, then subtracts the result from float32 vector zmm1. The final result is written into 
float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 


RoundingMode = 
tmpSrc3[511:] 


} else { 


} 


RoundingMode = 


tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 


SSS[1:0] 
= zmm3[511:0] 


MXCSR.RC 


for (n = @; n < 16; n++) { 
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if(ki[n] != @) { 
1 = 324n 
// float32 operation 
zmm1[i+31:i] = -(zmm1[it31:i] * tmpSrc3[it+31:i]) + zmm2[i+31:i] 


a, 
t 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


m512 


— 


—m512 _mm512_fnmadd_ps (_m512,_m512,__m512); 

_—m512 

—m512 _mm512_mask3_fnmadd_ps (_m512 
Exceptions 


_mm512_mask_fnmadd_ps (_m512,__mmask16,_m512,__m512); 
m512 


mmask16); 


i—_ 


Instruction not available in these modes 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 
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#PF(fault-code) 
#NM 
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If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction 
MVEX.NDS.512.66.0F38.W1 AC /r vfnmadd213pd zmm1 {k1}, zmm2, S'¢g4(zmm3/m;) 


Description 

Multiply float64 
vector zmm2 
and float64 
vector zmm1, 
negate, and add 
the result to 
float64 vector 
Syea(zmm3/m,), 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication of float64 vector zmm1 and the float64 
vector result of the swizzle/broadcast/conversion process on memory or vector float64 
zmm3, then subtracts the result from float64 vector zmm2. The final result is written into 


float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 

RoundingMode = SSS[1:0] 

tmpSrc3[511:0] = zmm3[511:0] 
} else { 

RoundingMode = MXCSR.RC 

tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(ki[n] != @) { 
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i = 64*n 
// float64 operation 
zmm1[it+63:i] = -(zmm2[it63:i] * zmm1[it63:i]) + tmpSrc3[it+63:i] 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 


(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 


(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512d _mm512_fnmadd_pd (_m512d,_m512d,__m512d); 
_—m512d _mm512_mask_fnmadd_pd (_m512d,_mmask8,_m512d,_m512d); 
_—m512d _mm512_mask3_fnmadd_pd (_m512d,__m512d,__m512d,__mmask8); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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#PF(fault-code) 
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in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction 
MVEX.NDS.512.66.0F38.W0 AC /r vfnmadd213ps zmm1 {k1}, zmm2, Sy32(zmm3/m) 


Description 

Multiply float32 
vector zmm2 
and float32 
vector zmm1l, 
negate, and 
add the result 
to float32 vector 
Sys2(amm3/m,), 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication of float32 vector zmm1 and the float32 
vector result of the swizzle/broadcast/conversion process on memory or vector float32 
zmm3, then subtracts the result from float32 vector zmm2. The final result is written into 


float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 


All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 


// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 

RoundingMode = SSS[1:0] 

tmpSrc3[511:0] = zmm3[511:0] 
} else { 

RoundingMode = MXCSR.RC 

tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 


Reference Number: 327364-001 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


1 = 32&n 
// float32 operation 
zmm1[i+31:i] = -(zmm2Cit+31:i] * zmm1[it+31:i]) + tmpSrc3[it+31:i] 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


m512 


— 


—m512 _mm512_fnmadd_ps (_m512,_m512,__m512); 

_—m512 

—m512 _mm512_mask3_fnmadd_ps (_m512 
Exceptions 


_mm512_mask_fnmadd_ps (_m512,__mmask16,_m512,__m512); 
m512 


mmask16); 


i—_ 


Instruction not available in these modes 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 
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#PF(fault-code) 
#NM 
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If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode 
MVEX.NDS.512.66.0F38.W1 BC /r 


Instruction 
vfnmadd231pd zmm1 {k1}, zmm2, S'¢g4(zmm3/m;) 


Description 
Multiply 
float64 vec- 
tor zmm2 and 
float64 vector 
Syea(zmm3/m,), 
negate, and add 
the result to 
float64 vector 
zmm1, and 
store the fi- 
nal result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication of float64 vector zmm2 and float64 vec- 
tor zmm1, then subtracts the result from the float64 vector result of the swizzle/broadcast/conversion 
process on memory or vector float64 zmm3. The final result is written into float64 vector 


zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 


tmpSrc3[511:0] = zmm3[511:0] 


} else { 


RoundingMode = MXCSR.RC 


tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m) 


} 


for (n = @; n < 8; n++) { 
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if(ki[n] != @) { 
i = 644n 
// float64 operation 
zmm1[i+63:i] = -(zmm2[it63:i] * tmpSrc3[it+63:i]) + zmm1[i+63:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512d _mm512_fnmadd_pd (_m512d,_m512d,__m512d); 
_—m512d _mm512_mask_fnmadd_pd (_m512d,_mmask8,_m512d,_m512d); 
_—m512d _mm512_mask3_fnmadd_pd (_m512d,__m512d,__m512d,__mmask8); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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Reference Number: 327364-001 


in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode 
MVEX.NDS.512.66.0F38.W0 BC /r 


Instruction 
vfnmadd231ps zmm1 {k1}, zmm2, S32(zmm3/m) 


Description 
Multiply 
float32 vec- 
tor zmm2 and 
float32 vector 
Sp32(zmm3/mz), 
negate, and add 
the result to 
float32 vector 
zmm1, and 
store the  fi- 
nal result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication of float32 vector zmm2 and float32 vec- 
tor zmm1, then subtracts the result from the float32 vector result of the swizzle/broadcast/conversion 
process on memory or vector float32 zmm3. The final result is written into float32 vector 


zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 


tmpSrc3[511:0] = zmm3[511:0] 


} else { 


RoundingMode = MXCSR.RC 


tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 


} 


for (n = @; n < 16; n++) { 
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if(ki[n] != @) { 
1 = 324n 
// float32 operation 
zmm1[i+31:i] = -(zmm2[it31:i] * tmpSrc3[it+31:i]) + zmm1[it+31:i] 


a, 
t 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


m512 


— 


—m512 _mm512_fnmadd_ps (_m512,_m512,__m512); 

_—m512 

—m512 _mm512_mask3_fnmadd_ps (_m512 
Exceptions 


_mm512_mask_fnmadd_ps (_m512,__mmask16,_m512,__m512); 
m512 


mmask16); 


i—_ 


Instruction not available in these modes 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 
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If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFNMSUB132PD - Multiply Destination By Second Source, Negate, and 
Subtract First Source Float64 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W1 9E /r 


Instruction 


vfnmsub132pd zmm1 {k1}, zmm2, Sfe4(zmm3/m,) 


Description 
Multiply 
float64 vec- 
tor zmm1 and 
float64 vector 
Syea(zmm3/m,), 
negate, and sub- 
tract float64 
vector zmm2 
from the result, 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float64 vector zmm1 and the 
float64 vector result of the swizzle/broadcast/conversion process on memory or vector 
float64 zmm3, negates, and subtracts float64 vector zmmz2. The final result is written into 


float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


Table 6.11: VFNMSUB outcome when adding zeros depending on rounding-mode 


x*y Zz RN/RU/RZ RD 

+0 +0 |] (0) +(-0) =-0 | (-0) +(-0) =-0 
+0 -0 | (0) +(+0) =+0) (0) +(+0) =-0 
0 =+0 | (40) +(-0) =+0) (+0) +(-0) =-0 
-0 -0 (+0) +(+0) =+0) (+0) +(+0) =+0 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fea (zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != ) { 
i = 644n 
// float64 operation 
zmm1[it+63:i] = (-Czmm1[i+63:i] * tmpSrc3[i+63:i]) - zmm2[i+63:i]) 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fnmsub_pd (_m512d,__m512d,__m512d); 
_—m512d _mm512_mask_fnmsub_pd (_m512d,_mmask8,_m512d,_m512d); 
_—m512d _mm512_mask3_fnmsub_pd (_m512d,__m512d,_m512d, _mmask8); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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Reference Number: 327364-001 


in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFNMSUB132PS - Multiply Destination By Second Source, Negate, and 
Subtract First Source Float32 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W0 9E /r 


Instruction 
vfnmsub132ps zmm1 {k1}, zmm2, S'f32(zmm3/m;z) 


Description 
Multiply 
float32 vec- 
tor zmm1 and 
float32 vector 
Sf32(zmm3/mz«), 
negate, and sub- 
tract float32 
vector zmm2 
from the result, 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float32 vector zmm1 and the 
float32 vector result of the swizzle/broadcast/conversion process on memory or vector 
float32 zmm3, negates, and subtracts float32 vector zmmz2. The final result is written into 


float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


x*y Zz RN/RU/RZ RD 

+0 +0 |] (0) +(-0) =-0 | (-0) +(-0) =-0 
+0 -0 | (0) +(+0) =+0) (0) +(+0) =-0 
0 =+0 | (40) +(-0) =+0) (+0) +(-0) =-0 
-0 -0 (+0) +(+0) =+0) (+0) +(+0) =+0 


Table 6.12: VFNMSUB outcome when adding zeros depending on rounding-mode 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 324n 
// float32 operation 
zmm1[it+31:i] = (-Czmm1[i+31:i] * tmpSrc3[i+31:i]) - zmm2[i+31:i]) 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 
_—m512 
_—m512 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


_mm512_fnmsub_ps (_m512,_m512,__m512); 
_mm512_mask_fnmsub_ps (_m512,__mmask16,_m512,__m512); 
_mm512_mask3_fnmsub_ps (_m512,__m512,__m512,__mmask16); 


Instruction not available in these modes 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 
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If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


279 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


(intel. 


VFNMSUB213PD - Multiply First Source By Destination, Negate, and Sub- 


tract Second Source Float64 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W1 AE /r 


Instruction 


vfnmsub213pd zmm1 {k1}, zmm2, Se4(zmm3/m,) 


Description 
Multiply float64 
vector zmm2 
and float64 
vector zmm1l, 
negate, and 
subtract 
float64 vector 
Spoa(zmm3/mz) 
from the result, 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float64 vector zmm2 and float64 
vector zmm1, negates, and subtracts the float64 vector result of the swizzle/broadcast/conversion 
process on memory or vector float64 zmm3. The final sum is written into float64 vector 


zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


Table 6.13: VFNMSUB outcome when adding zeros depending on rounding-mode 


x*y Zz RN/RU/RZ RD 

+0 +0 |] (0) +(-0) =-0 | (-0) +(-0) =-0 
+0 -0 | (0) +(+0) =+0) (0) +(+0) =-0 
0 =+0 | (40) +(-0) =+0) (+0) +(-0) =-0 
-0 -0 (+0) +(+0) =+0) (+0) +(+0) =+0 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4 (zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != ) { 
i = 644n 
// float64 operation 
zmm1[it+63:i] = (-Czmm2[i+63:i] * zmm1[i+63:i]) - tmpSrc3[it+63:i]) 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fnmsub_pd (_m512d,__m512d,__m512d); 
_—m512d _mm512_mask_fnmsub_pd (_m512d,_mmask8,_m512d,_m512d); 
_—m512d _mm512_mask3_fnmsub_pd (_m512d,__m512d,_m512d, _mmask8); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFNMSUB213PS - Multiply First Source By Destination, Negate, and Sub- 


tract Second Source Float32 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W0 AE /r 


Instruction 


vfnmsub213ps zmm1 {k1}, zmm2, S'¢32(zmm3/mz) 


Description 
Multiply 
float32 vec- 
tor zmm2 and 
float32 vector 
zmm1, negate, 
and subtract 
float32 vector 
Sf32(zmm3/me) 
from the result, 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float32 vector zmm2 and float32 
vector zmm1, negates, and subtracts the float32 vector result of the swizzle/broadcast/conversion 
process on memory or vector float32 zmm3. The final sum is written into float32 vector 


zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


Table 6.14: VFNMSUB outcome when adding zeros depending on rounding-mode 


x*y Zz RN/RU/RZ RD 

+0 +0 |] (0) +(-0) =-0 | (-0) +(-0) =-0 
+0 -0 | (0) +(40) =+0) (0) +(+0) =-0 
0 =+0 | (40) +(-0) =+0) (+0) +(-0) =-0 
-0 -0 (+0) +(+0) =+0) (+0) +(+0) =+0 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 324n 
// float32 operation 
zmm1[it+31:i] = (-Czmm2[i+31:i] * zmm1[i+31:i]) - tmpSrc3[it+31:i]) 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Reference Number: 327364-001 285 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


> 
D 


286 


Register Swizzle: S r39 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 
_—m512 
_—m512 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


_mm512_fnmsub_ps (_m512,_m512,__m512); 
_mm512_mask_fnmsub_ps (_m512,__mmask16,_m512,__m512); 
_mm512_mask3_fnmsub_ps (_m512,__m512,__m512,__mmask16); 


Instruction not available in these modes 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 
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If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFNMSUB231PD - Multiply First Source By Second Source, Negate, and 
Subtract Destination Float64 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W1 BE /r 


Instruction 


vfnmsub231pd zmm1 {k1}, zmm2, S¢4(zmm3/m,) 


Description 
Multiply 
float64 vec- 
tor zmm2 and 
float64 vector 
Spea(zmm3/mz), 
negate, and 
subtract float64 
vector zmm1l 
from the result, 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float64 vector zmm2 and the 
float64 vector result of the swizzle/broadcast/conversion process on memory or vector 
float64 zmm3, negates, and subtracts float64 vector zmm1. The final result is written into 


float64 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


Table 6.15: VFMADDN outcome when adding zeros depending on rounding-mode 


x*y Zz RN/RU/RZ RD 

+0 +0 |] (0) +(-0) =-0 | (-0) +(-0) =-0 
+0 -0 | (0) = +(+0) =+0) (0) +(+0) =-0 
0 =+0 | (40) +(-0) =+0) (+0) +(-0) =-0 
-0 -0 (+0) +(+0) =+0) (+0) +(+0) =+0 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fea (zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != ) { 
i = 644n 
// float64 operation 
zmm1[it+63:i] = (-(Czmm2[i+63:i] * tmpSrc3[i+63:i]) - zmm1[i+63:i]) 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢¢4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1So || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_fnmsub_pd (_m512d,__m512d,__m512d); 
_—m512d _mm512_mask_fnmsub_pd (_m512d,_mmask8,_m512d,_m512d); 
_—m512d _mm512_mask3_fnmsub_pd (_m512d,__m512d,_m512d, _mmask8); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
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#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VFNMSUB231PS - Multiply First Source By Second Source, Negate, and 
Subtract Destination Float32 Vectors 


Opcode 


MVEX.NDS.512.66.0F38.W0 BE /r 


Instruction 


vfnmsub231ps zmm1 {k1}, zmm2, Sp32(zmm3/m;) 


Description 
Multiply 
float32 vec- 
tor zmm2 and 
float32 vector 
Sf32(zmm3/mz), 
negate, and sub- 
tract float32 
vector zmm1 
from the result, 
and store the 
final result in 
zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication between float32 vector zmm2 and the 
float32 vector result of the swizzle/broadcast/conversion process on memory or vector 
float32 zmm3, negates, and subtracts float32 vector zmm1. The final result is written into 


float32 vector zmm1. 


Intermediate values are calculated to infinite precision, and are not truncated or rounded. 
All operations must be performed previous to final rounding. 


x*y Zz RN/RU/RZ RD 

+0 +0 |] (0) +(-0) =-0 | (-0) +(-0) =-0 
+0 -0 | (0) = +(+0) =+0) (0) +(+0) =-0 
0 =+0 | (40) +(-0) =+0) (+0) +(-0) =-0 
-0 -0 (+0) +(+0) =+0) (+0) +(+0) =+0 


Table 6.16: VFMADDN outcome when adding zeros depending on rounding-mode 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 


ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 


RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 324n 
// float32 operation 
zmm1[it+31:i] = (-Czmm2[i+31:i] * tmpSrc3[i+31:i]) - zmm1[i+31:i]) 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S ¢32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r39 


MVEX.EH=0 

S2S1So || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 
_—m512 
_—m512 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


_mm512_fnmsub_ps (_m512,_m512,__m512); 
_mm512_mask_fnmsub_ps (_m512,__mmask16,_m512,__m512); 
_mm512_mask3_fnmsub_ps (_m512,__m512,__m512,__mmask16); 


Instruction not available in these modes 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 
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#GP(0) 


#PF(fault-code) 
#NM 
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If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VGATHERDPD - Gather Float64 Vector With Signed Dword Indices 
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Opcode Instruction Description 
MVEX.512.66.0F38.W1 92 vgatherdpd zmm1 {k1}, Gather float64 vector Usga(mz;) into float64 
/r /vsib Urea(mrt) vector zmm1 using doubleword indices and k1 


as completion mask. 


Description 


A set of 8 memory locations pointed by base address BASE_ADDR and doubleword 
index vector VIN DEX with scale SCALE are converted to a float64 vector. The result 
is written into float64 vector zmm1. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Note that accessed element by will always access 64 bytes of memory. The memory region 
accessed by each element will always be between elemen_linear_address & (~0x3F) and 
(element_linear_address & (~0x3F)) + 63 boundaries. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully loaded. 


The instruction will #GP fault if the destination vector zmm1 is the same as index vector 
VINDEX. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mu; as vector memory operand (VSIB) 
for (n = @; n < 8; n++) { 


if (ktemp[n] != @) { 


Reference Number: 327364-001 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


i = 644n 
j = 32«n 
// mu,zLn] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE) 
pointer[63:0] = mv;,L[n] 
zmm1[it+63:i] = UpConvLoad fg4 (pointer) 
k1[n] = @ 
} 


I 
k1[15:8] = @ 


SIMD Floating-Point Exceptions 


None. 


Memory Up-conversion: U r¢4 


S25S1Spo || Function: Usage disp8*N 
000 no conversion [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_i32logather_pd (_m512i, void const*, int); 
_—m512d _mm512_mask_i32logather_pd (_m512d,_mmask8, _m512i, void const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form, and corresponding write-mask bit is not zero. 
#GP(0) If a memory address is in a non-canonical form, 


and corresponding write-mask bit is not zero. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the UpConv 
and corresponding write-mask bit is not zero. 

If the destination vector is the same as the index vector [see 


#PF(fault-code) If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 
If no write mask is provided or selected write-mask is k0. 
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VGATHERDPS - Gather Float32 Vector With Signed Dword Indices 


Opcode Instruction Description 

MVEX.512.66.0F38.W0 92 vgatherdpszmm1 {k1}, Ur32(mu;) Gather float32 vector Ur32(mu;) into float32 

/r /vsib vector zmm1 using doubleword indices and k1 
as completion mask. 


Description 


A set of 16 memory locations pointed by base address BAS E_ADDR and doubleword 
index vector VIN DEX with scale SCALE are converted to a float32 vector. The result 
is written into float32 vector zmm1. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Note that accessed element by will always access 64 bytes of memory. The memory region 
accessed by each element will always be between elemen_linear_address & (~0x3F) and 
(element_linear_address & (~0x3F)) + 63 boundaries. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully loaded. 


The instruction will #GP fault if the destination vector zmm1 is the same as index vector 
VINDEX. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (ktemp[n] != @) { 
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1 = 32an 

// mu,Ln] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE) 
pointer[63:0] = mu;Ln] 

zmm1[it+31:i] = UpConvLoad 32 (pointer) 

k1[n] = @ 


SIMD Floating-Point Exceptions 


Invalid. 


Memory Up-conversion: U ;32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_i32gather_ps (_m512i, void const*, int); 

—m512 _mm512_mask_i32gather_ps (_m512,__mmask16, __m512i, void const’, int); 

—m512 _mm512_i32extgather_ps (_m512i, void const*, MM_UPCONV_PS_ENUM, int, 
int); 

—m512 _mm512_mask_i32extgather_ps (_m512, _mmask16, _m512i, void const*, 
_MM_UPCONV_PS_ENUM, int, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
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#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 


#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form, and corresponding write-mask bit is not zero. 
If a memory address is in a non-canonical form, 

and corresponding write-mask bit is not zero. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the UpConv 

and corresponding write-mask bit is not zero. 

If the destination vector is the same as the index vector [see 


If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If using a 16 bit effective address. 

If ModRM.rm is different than 100b. 

If no write mask is provided or selected write-mask is k0. 
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VGATHERPFODPS - Gather Prefetch Float32 Vector With Signed Dword 
Indices Into L1 


302 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 C6 vgatherpfOdps Ur32(mv;) {k1} Gather Prefetch float32 vector U32(muz), using 
/1 /vsib doubleword indices with TO hint, under write- 
mask. 
Description 


Aset of 16 float32 memory locations pointed by base address BAS E_ADD Rand double- 
word index vector VI NDEX withscale SCALE are prefetched from memory to L1 level 
of cache. If any memory access causes any type of memory exception, the memory access 
will be considered as completed (destination mask updated) and the exception ignored. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
prefetch sequence have been prefetched and hence, the write-mask bits all are zero). 


Note that accessed element by will always access 64 bytes of memory. The memory region 
accessed by each element will always be between elemen_linear_address & (~0x3F) and 
(element_linear_address & (~0x3F)) + 63 boundaries. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully loaded. 


Note that both gather and scatter prefetches set the access bit (A) in the related TLB page 
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D). 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


exclusive = @ 
evicthintpre = MVEX.EH 
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// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (ktemp[n] != @) { 
1 = 32&n 
// mu,Ln] = BASE_ADDR + SignExtend(VINDEXLi+31:i] * SCALE) 
pointer[63:0] = mu;[n] 
FetchLi1cacheLine(pointer, exclusive, evicthintpre) 
k1[n] = @ 


SIMD Floating-Point Exceptions 


None. 


Memory Up-conversion: U;35 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_prefetch_i32gather_ps (_m512i, void const’, int, int); 

void _mm512_mask_prefetch_i32gather_ps (_m512i, _mmask16, void const*, int, 
int); 

void _mm512_prefetch_i32extgather_ps (_m512i, void const*, 
_MM_UPCONV_PS_ENUM,, int, int); 

void _mm512_mask_prefetch_i32extgather_ps ( _m512i, _mmask16, void const*, 
_MM_UPCONV_PS_ENUM, int, int); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If using a 16 bit effective address. 

If ModRM.rm is different than 100b. 

If no write mask is provided or selected write-mask is k0. 
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VGATHERPFOHINTDPD - Gather Prefetch Float64 Vector Hint With Signed 
Dword Indices 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 C6  vgatherpfOhintdpd Uyea(mu;) Gather Prefetch float64 vector U64(mvz), using 
/0 /vsib {k1} doubleword indices with TO hint, under write- 
mask. 
Description 


The instruction specifies a set of 8 float64 memory locations pointed by base address 
BASE_ADDR and doubleword index vector VI! NDEX with scale SCALE as a perfor- 
mance hint that a real gather instruction with the same set of sources will be invoked. A 
programmer may execute this instruction before a real gather instruction to improve its 
performance. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. This instructions does not 
modify any kind of architectural state (including the write-mask). 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Operation 


// Use mu, as vector memory operand (VSIB) 
for (n = @; n < 8; nt+) { 
if (k1[n] != 0) { 
i = 644n 
j = 32«n 
// mvu,Ln] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE) 
pointer[63:0] = mv,Ln] 
HintPointer (pointer) 


SIMD Floating-Point Exceptions 


None. 
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Memory Up-conversion: U s¢4 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


None 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If using a 16 bit effective address. 

If ModRM.rm is different than 100b. 

If no write mask is provided or selected write-mask is k0. 
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VGATHERPFOHINTDPS - Gather Prefetch Float32 Vector Hint With Signed 
Dword Indices 


Opcode Instruction Description 
MVEX.512.66.0F38.W0O C6 vgatherpfOhintdps Uy32(mv;,) Gather Prefetch float32 vector Uy32(mvu;), using 
/0 /vsib {k1} doubleword indices with TO hint, under write- 
mask. 
Description 


The instruction specifies a set of 16 float32 memory locations pointed by base address 
BASE_ADDR and doubleword index vector VN DEX with scale SCALE as a perfor- 
mance hint that a real gather instruction with the same set of sources will be invoked. A 
programmer may execute this instruction before a real gather instruction to improve its 
performance. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. This instructions does not 
modify any kind of architectural state (including the write-mask). 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Operation 


// Use mu; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
1 = 32&n 
// mu,Ln] = BASE_ADDR + SignExtend(VINDEXLi+31:i] * SCALE) 
pointer[63:0] = mv,Ln] 
HintPointer (pointer) 
} 
} 


SIMD Floating-Point Exceptions 


None. 
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Memory Up-conversion: U ;32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


None 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#NM 
#UD 
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Instruction not available in these modes 


If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 


If no write mask is provided or selected write-mask is k0. 
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VGATHERPF1DPS - Gather Prefetch Float32 Vector With Signed Dword 
Indices Into L2 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 C6 vgatherpfidps Uy32(mvu;) {k1} Gather Prefetch float32 vector U32(mvuz), using 
/2 /vsib doubleword indices with T1 hint, under write- 
mask. 
Description 


Aset of 16 float32 memory locations pointed by base address BAS E_ADD Rand double- 
word index vector VI NDEX withscale SCALE are prefetched from memory to L2 level 
of cache. If any memory access causes any type of memory exception, the memory access 
will be considered as completed (destination mask updated) and the exception ignored. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
prefetch sequence have been prefetched and hence, the write-mask bits all are zero). 


Note that accessed element by will always access 64 bytes of memory. The memory region 
accessed by each element will always be between elemen_linear_address & (~0x3F) and 
(element_linear_address & (~0x3F)) + 63 boundaries. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully loaded. 


Note that both gather and scatter prefetches set the access bit (A) in the related TLB page 
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D). 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


exclusive = @ 
evicthintpre = MVEX.EH 
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// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (ktemp[n] != @) { 
1 = 32&n 
// mu,Ln] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE) 
pointer[63:0] = mu;L[n] 
FetchL2cacheLine(pointer, exclusive, evicthintpre) 
k1[n] = @ 


SIMD Floating-Point Exceptions 


None. 


Memory Up-conversion: U;35 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_prefetch_i32gather_ps (_m512i, void const’, int, int); 

void _mm512_mask_prefetch_i32gather_ps (_m512i, _mmask16, void const*, int, 
int); 

void _mm512_prefetch_i32extgather_ps (_m512i, void const*, 
_MM_UPCONV_PS_ENUM,, int, int); 

void _mm512_mask_prefetch_i32extgather_ps ( _m512i, _mmask16, void const*, 
_MM_UPCONV_PS_ENUM, int, int); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If using a 16 bit effective address. 

If ModRM.rm is different than 100b. 

If no write mask is provided or selected write-mask is k0. 
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VGETEXPPD - Extract Float64 Vector of Exponents from Float64 Vector 


Opcode Instruction Description 

MVEX.512.66.0F38.W1 42 vgetexppd zmm1 {k1}, _ Extract float64 vector of exponents from vector 

/r S'rga(zmm2/m) S'rga(zmm2/m,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element exponent extraction from the Float64 vector result of 
the swizzle/broadcast/conversion process on memory or Float64 vector zmm2. The re- 
sult is written into Float64 vector zmm1. 


GetExp() returns the (un-biased) exponent n in floating-point format. That is, when X = 
1/16, GetExp() returns the value —4, represented as C0800000 in IEEE single precision 
(for the single-precision version of the instruction). If the source is denormal, VGETEXP 
will normalize it prior to exponent extraction (unless DAZ=1). 


GetExp() function follows Table 6.17 when dealing with floating-point special number. 


Input | Result 
NaN _ | quietized input NaN 


+00 +00 
+0 —oo 
-0 —oo 
—oo +00 


Table 6.17: GetExp() special floating-point values behavior 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad fg4 (zmm2/m) 

} 


for (n = @; n < 8; n++) { 
if(ki[n] != 0) { 
i = 644n 
zmm1[i+63:i] = GetExp(tmpSrc2[i+63:i]) 
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SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S ¢¢4 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S r¢4 


MVEX.EH=0 

S5S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

5251509 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_m512d _mm512_getexp_pd (_m512d); 
_—m512d _mm512_mask_getexp_pd (_m512d,_mmask8,_m512d); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


314 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VGETEXPPS - Extract Float32 Vector of Exponents from Float32 Vector 


Opcode Instruction Description 

MVEX.512.66.0F38.W0 42 vgetexpps zmm1 {k1}, Extract float32 vector of exponents from vector 

/r S'p32(zmm2/m_) S'732(zmm2/m,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element exponent extraction from the Float32 vector result of 
the swizzle/broadcast/conversion process on memory or Float32 vector zmmz2. The re- 
sult is written into Float32 vector zmm1. 


GetExp() returns the (un-biased) exponent n in floating-point format. That is, when X = 
1/16, GetExp() returns the value —4, represented as C0800000 in IEEE single precision 
(for the single-precision version of the instruction). If the source is denormal, VGETEXP 
will normalize it prior to exponent extraction (unless DAZ=1). 


GetExp() function follows Table 6.18 when dealing with floating-point special number. 


Input | Result 
NaN _ | quietized input NaN 


+00 +00 
+0 —oo 
-0 —oo 
—oo +00 


Table 6.18: GetExp() special floating-point values behavior 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad ¢32 (zmm2/m) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 32an 
zmm1[i+31:i] = GetExp(tmpSrc2[i+31:i]) 
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SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S ¢32 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ;35 


MVEX.EH=0 
S5S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 
525159 || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_m512 _mm512_getexp_ps (_m512); 
_—m512 _mm512_mask_getexp_ps (_m512,_mmask16,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VGETMANTPD - Extract Float64 Vector of Normalized Mantissas from 
Float64 Vector 
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Opcode Instruction Description 
MVEX.512.66.0F3A.W1 26 vgetmantpd zmm1 {k1}, Get Normalized Mantissa from float64 vector 
/rib Syga(zmm2/m,), imm8 S'rga(zmm2/m,) and store the result in zmm1, 
using 77mmé8 for sign control and mantissa inter- 
val normalization, under write-mask. 
Description 


Performs an element-by-element conversion of the Float64 vector result of the swiz- 
zle/broadcast/conversion process on memory or Float64 vector zmm2 to Float64 values 
with the mantissa normalized to the interval specified by interv and sign dictated by the 
sign control parameter sc. The result is written into Float64 vector zmm1. Denormal val- 
ues are explicitly normalized. 


The formula for the operation is: 


Get Mant(z) 


where: 


L2* |x. signi ficand| 


1 <= |x. signi ficand| < 2 


Exponent k is dependent on the interval range defined by interv and whether the expo- 
nent of the source is even or odd. The sign of the final result is determined by sc and the 


source sign. 


GetMant() function follows Table 6.19 when dealing with floating-point special numbers. 


Input | Result Exceptions/comments 

NaN | QNaN(SRC) Raises #1 if sNaN 

+00 +00 ignore interv 

+0 +0.0 ignore interv 

-0 (SC[0])? +0.0 : —0.0 | ignore interv, set NaN/raise #1 if SC[1]=1 
—oo (SC[0])? +00 : —oo ignore interv, set NaN/raise #1 if SC[1]=1 
<0 set NaN/raise #1 if SC[1]=1 


Table 6.19: GetMant() special floating-point values behavior 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Immediate Format 


Normalization Interval I, Ip 
[1,2) 0 0 
[1/2,2) 0 1 
[1/2,1) 1 0 
[3/4,3/2) 1-4 
Sign Control Tz 15 
sign = sign(SRC) 0 O 
sign = 0 0 1 
DEST = NaN (#1) ifsign(SRC)=1 | 1 x 


Operation 


GetNormalizedMantissa(SRC , SignCtrl, Interv) 
{ 
// Extracting the SRC sign, exponent and mantissa fields 
SIGN = (SignCtr1Ll0])? @ : SRC[63]; 
EXP = SRC[63:52]; 
FRACT = (DAZ && (EXP == Q))? @ : SRC[51:0]; 


// Check for NaN operand 

if(IsNaN(SRC)) { 
if(IsSNaN(SRC)) *xset I flagx 
return QNaN(SRC) 

} 


// If SignCtr1[1] is set to 1, return NaN and set 

// exception flag if the operand is negative. 

// Note that -@.@ is included 

if( SignCtr1[1] && (SRC[63] == 1) ) { 
*xset I flagx 
return QNaN_Indefinite 


} 


// Check for +/-INF and +/-@ 
if( ( EXP == @x7FF && FRACTION == Q ) 
|| © EXP == @ && FRACTION == @ ) ) { 
DEST[63:@] = (SIGN << 63) | CEXPL11:0] << 52) | FRACT[51:0]; 
return DEST 
} 


// Normalize denormal operands 

// note that denormal operands are treated as zero if 

// DAZ is set to 1 

if((EXP == @) && (FRACTION !=0) { 
// JBIT is the hidden integral bit 
JBIT = Q; // Zero in case of denormal operands 
EXP = Q3FFh; // Set exponent to BIAS 
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While(JBIT == 0) { 


JBIT = FRACT[51]; // Obtain fraction MSB 
FRACT = FRACT << 1; // Normalize mantissa 
EXP--; // and adjust exponent 
} 
xset D flagx 
} 
// Apply normalization intervals 
UNBIASED_EXP = EXP - @3FFh; // get exponent in unbiased form 
IS_ODD_EXP = UNBIASED_EXP[Q]; // if the unbiased exponent odd? 


if( (Interv == 10b) 
|| © (Interv == Q1b) && IS_ODD_EXP) 
|| (© (Interv == 11b) && CFRACTL51J==1)) ) £{ 


EXP = Q3FEh; // Set exponent to -1 (unbiased) 
} 
else { 

EXP = Q3FFh; // Set exponent to @ (unbiased) 
} 


// form the final destination 


DEST[63:0] = (SIGN << 63) | (EXP[11:0] << 52) | FRACT[51:0]; 


return DEST 


sc = IMM8[3:2] 
interv = IMM8[1:0] 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad rg4(zmm2/m; ) 

} 


for (n = @; n < 8; n++) { 
if(ki[n] != 0) { 
i = 644n 
// float64 operation 


zmm1[it+63:i] = GetNormalizedMantissa(tmpSrc2[i+63:i], sc, interv) 
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SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 


Memory Up-conversion: S564 


S2515o || Function: Usage disp8*N 

000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 

100 reserved N/A N/A 

101 reserved N/A N/A 

110 reserved N/A N/A 

111 reserved N/A N/A 
Register Swizzle: S ¢¢4 

MVEX.EH=0 

S55S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 

001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 

MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_getmant_pd (_m512d, _MM_MANTISSA_NORM_ENUM, 
_MM_MANTISSA_SIGN_ENUM); 
_—m512d _mm512_mask_getmant_pd (_m512d, —_mmaské8, _m512d, 


_MM_MANTISSA_NORM_ENUM, _MM_MANTISSA_SIGN_ENUM); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VGETMANTPS - Extract Float32 Vector of Normalized Mantissas from 
Float32 Vector 


Opcode Instruction Description 
MVEX.512.66.0F3A.W0O 26 vgetmantps zmm1 {k1}, Get Normalized Mantissa from float32 vector 
/rib S'p32(zmm2/m;), imms S'f32(zmm2/m,) and store the result in zmm1, 


using 7mmé8 for sign control and mantissa inter- 
val normalization, under write-mask. 


Description 


Performs an element-by-element conversion of the Float32 vector result of the swiz- 
zle/broadcast/conversion process on memory or Float32 vector zmm2 to Float32 values 
with the mantissa normalized to the interval specified by interv and sign dictated by the 
sign control parameter sc. The result is written into Float32 vector zmm1. Denormal val- 
ues are explicitly normalized. 


The formula for the operation is: 


Get Mant(x) = +2*|x.signi ficand| 
where: 
1 <= |x. signi ficand| < 2 


Exponent k is dependent on the interval range defined by interv and whether the expo- 
nent of the source is even or odd. The sign of the final result is determined by sc and the 
source sign. 


GetMant() function follows Table 6.20 when dealing with floating-point special numbers. 


Input | Result Exceptions/comments 

NaN | QNaN(SRC) Raises #1 if sNaN 

+00 +00 ignore interv 

+0 +0.0 ignore interv 

-0 (SC[0])? +0.0 : —0.0 | ignore interv, set NaN/raise #1 if SC[1]=1 
—oo (SC[0])? +00 : —oo ignore interv, set NaN/raise #1 if SC[1]=1 
<0 set NaN/raise #1 if SC[1]=1 


Table 6.20: GetMant() special floating-point values behavior 
This instruction is write-masked, so only those elements with the corresponding bit set 


in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Immediate Format 


Normalization Interval I, Ip 
[1,2) 0 0 
[1/2,2) 0 1 
[1/2,1) 1 0 
[3/4,3/2) 1-4 
Sign Control Tz 15 
sign = sign(SRC) 0 O 
sign = 0 0 1 
DEST = NaN (#1) ifsign(SRC)=1 | 1 x 


Operation 


GetNormalizedMantissa(SRC , SignCtrl, Interv) 
{ 
// Extracting the SRC sign, exponent and mantissa fields 
SIGN = (SignCtr1l0])? @ : SRC[31]; 
EXP = SRC[30:23]; 
FRACT = (DAZ && (EXP == Q))? @ : SRC[22:0]; 


// Check for NaN operand 

if(IsNaN(SRC)) { 
if(IsSNaN(SRC)) *xset I flagx 
return QNaN(SRC) 

} 


// If SignCtr1[1] is set to 1, return NaN and set 

// exception flag if the operand is negative. 

// Note that -@.@ is included 

if( SignCtr1[1] && (SRC[31] == 1) ) { 
*xset I flagx 
return QNaN_Indefinite 


} 


// Check for +/-INF and +/-@ 
if( ( EXP == @xFF && FRACTION == @ ) 
|| © EXP == @ && FRACTION == @ ) ) { 
DEST[31:0] = (SIGN << 31) | CEXPL7:0] << 23) | FRACT[22:0]; 
return DEST 


} 

// Apply normalization intervals 

UNBIASED_EXP = EXP - Q7Fh; // get exponent in unbiased form 
IS_ODD_EXP = UNBIASED_EXP[Q]; // if the unbiased exponent odd? 


if( (Interv == 10b) 
[| ( (Interv == @1b) && IS_ODD_EXP) 
|| ( (Interv == 11b) && (FRACTE22]==1)) ) { 
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EXP = Q7Eh; // Set exponent to -1 (unbiased) 
} 
else { 

EXP = Q7Fh; // Set exponent to @ (unbiased) 
} 


// form the final destination 


DEST[31:@] = (SIGN << 31) | CEXPL7:@] << 23) | FRACT[22:0]; 


return DEST 


sc = IMM8[3:2] 
interv = IMM8[1:0] 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad ¢32(zmm2/m;) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != ) { 
i = 32x*n 
// float32 operation 


zmm1[i+31:i] = GetNormalizedMantissa(tmpSrc2[i+31:i], sc, interv) 


J 
t 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
Not Applicable 
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Memory Up-conversion: S 32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ;39 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 


_—m512 


Exceptions 


_mm512_getmant_ps (_m512, _MM_MANTISSA_NORM_ENUM, 
_MM_MANTISSA_SIGN_ENUM); 
_mm512_mask_getmant_ps (_m512, —_mmask16, 


_MM_MANTISSA_NORM_ENUM, _MM_MANTISSA_SIGN_ENUM); 


_m512, 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VGMAXABSPS - Absolute Maximum of Float32 Vectors 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vgmaxabsps zmm1 {k1}, zmm2, Determine the maximum of the absolute val- 
51/r S'f32(zmm3/mz) ues of float32 vector zmmz2 and float32 vector 


S'f32(zmm3/m,) and store the result in zmm1, 
under write-mask. 


Description 


Determines the maximum of the absolute values of each pair of corresponding elements 
in float32 vector zmm2 and the float32 vector result of the swizzle/broadcast/conversion 
process on memory or float32 vector zmm3. The result is written into float32 vector 
zmm1. 


Abs() returns the absolute value of one float32 argument. FpMax() returns the bigger 
of the two float32 arguments, following IEEE in general. NaN has special handling: If 
one source operand is NaN, then the other source operand is returned (choice made per- 
component). If both are NaN, then the unchanged NaN from the first source (here zmm2) 
is returned. Please note that if first source is a SNaN it won't be quietized, it will be re- 
turned without any modification. This differs from the new IEEE 754-08 rules, which 
states that in case of an input SNaN, its quietized version should be returned instead of 
the other value. 


Another new IEEE 754-08 rule is that max(-0,+0) == max(+0,-0) == +0, which honors the 
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recom- 
mends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the 
result of comparing zeros to be dependent on the order of parameters, using a comparison 
that ignores the signs. 


This instruction treats input denormals as zeros according to the DAZ control bit, but it 
does not flush tiny results to zero. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


FpMaxAbs (A,B) 
{ 
if ((A == NaN) && (B == NaN)) 
return Abs(A); 
else if (A == NaN) 
return Abs(B); 
else if (B == NaN) 
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return Abs(A); 
else if ((Abs(A) == tinf) || (Abs(B) == +inf)) 
return +tinf; 
else if (Abs(A) >= Abs(B)) 
return Abs(A); 
else 
return Abs(B); 


} 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 324n 
// float32 operation 
zmm1[i+31:i] = FpMaxAbs(zmm2[i+31:i] , tmpSrc3[i+31:i]) 
} 
} 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
NO 


Memory Up-conversion: S ss 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 
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Register Swizzle: S r35 


MVEX.EH=0 

S255 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 
_—m512 


Exceptions 


_mm512_gmaxabs_ps (_m512,__m512); 
_mm512_mask_gmaxabs_p s(_m512 


mmask16,_m512,__m512); 


i 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
#UD 


Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 

mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VGMAXPD - Maximum of Float64 Vectors 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W1 vgmaxpd zmmi1 {ki}, zmm2, Determine the maximum of float64 vector 
53 /r Sea(zmm3/m+) zmm2z2 and float64 vector S'fg4(zmm3/m;,) and 


store the result in zmm1, under write-mask. 


Description 


Determines the maximum value of each pair of corresponding elements in float64 vec- 
tor zmm2 and the float64 vector result of the swizzle/broadcast/conversion process on 
memory or float64 vector zmm3. The result is written into float64 vector zmm1. 


FpMax() returns the bigger of the two float32 arguments, following IEEE in general. NaN 
has special handling: If one source operand is NaN, then the other source operand is re- 
turned (choice made per-component). If both are NaN, then the unchanged NaN from the 
first source (here zmm2) is returned. Please note that if first source is a SNaN it won't be 
quietized, it will be returned without any modification. This differs from the new IEEE 
754-08 rules, which states that in case of an input SNaN, its quietized version should be 
returned instead of the other value. 


Another new IEEE 754-08 rule is that max(-0,+0) == max(+0,-0) == +0, which honors the 
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recom- 
mends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the 
result of comparing zeros to be dependent on the order of parameters, using a comparison 
that ignores the signs. 


This instruction treats input denormals as zeros according to the DAZ control bit, but it 
does not flush tiny results to zero. 


The following table describes exception flags priority: 


Input 1 Input 2 Flags Comments 

SNAN denormal #] #] priority over #D 
denormal SNAN #] #] priority over #D 

QNAN denormal | none QNaN rule priority over #D 
denormal QNAN none QNaNrule priority over #D 
normal denormal | #D__ onlyif DAZ=0 

denormal normal #D only if DAZ=0 

denormal denormal | #D __ only if DAZ=0 


Table 6.21: Max exception flags priority 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 

FpMax (A,B) 

{ 
if ((A == -0.0) && (B == +@.0)) return 
if ((A == +0.0) && (B == -@.0)) return 
if ((A == NaN) && (B == NaN)) return 
if (A == NaN) return 
if (B == NaN) return 
if (A == -inf) return 
if (B == -inf) return 
if (A == +inf) return 
if (B == +inf) return 
if (A >= B) return 
return B; 


} 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc3[511:@] 


} else { 


tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m,) 


} 


zmm3[511:0] 


for (n = @; n < 8; n++) { 


if(ki[n] != @) { 
i = 64&n 
// float64 operation 
zmm1[it+63:i] = FpMax(zmm2[it+63:i] , tmpSrc3[i+63:i]) 


a 
t 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 


(MXCSR. DAZ)? 


YES : NO 


Flush Tiny Results To Zero : 


NO 
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Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S r¢4 


MVEX.EH=0 

S55S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_gmax_pd (_m512d,__m512d); 


—m512d _mm512_mask_gmax_pd (_.m512d,__mmask8,_m512d,__m512d); 


j aes 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
#UD 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vgmaxps zmmi1 {k1}, zmm2, Determine the maximum of float32 vector 
53 /r S'p32(zmm3/mz) zmm2z2 and float32 vector S'r32(zmm3/m,) and 


store the result in zmm1, under write-mask. 


Description 


Determines the maximum value of each pair of corresponding elements in float32 vec- 
tor zmm2 and the float32 vector result of the swizzle/broadcast/conversion process on 
memory or float32 vector zmm3. The result is written into float32 vector zmm1. 


FpMax() returns the bigger of the two float32 arguments, following IEEE in general. NaN 
has special handling: If one source operand is NaN, then the other source operand is re- 
turned (choice made per-component). If both are NaN, then the unchanged NaN from the 
first source (here zmm2) is returned. Please note that if first source is a SNaN it won't be 
quietized, it will be returned without any modification. This differs from the new IEEE 
754-08 rules, which states that in case of an input SNaN, its quietized version should be 
returned instead of the other value. 


Another new IEEE 754-08 rule is that max(-0,+0) == max(+0,-0) == +0, which honors the 
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recom- 
mends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the 
result of comparing zeros to be dependent on the order of parameters, using a comparison 
that ignores the signs. 


This instruction treats input denormals as zeros according to the DAZ control bit, but it 
does not flush tiny results to zero. 


The following table describes exception flags priority: 


Input 1 Input 2 Flags Comments 

SNAN denormal #] #] priority over #D 
denormal SNAN #] #] priority over #D 

QNAN denormal | none QNaN rule priority over #D 
denormal QNAN none QNaNrule priority over #D 
normal denormal | #D__ onlyif DAZ=0 

denormal normal #D only if DAZ=0 

denormal denormal | #D__ only if DAZ=0 


Table 6.22: Max exception flags priority 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 

FpMax (A,B) 

{ 
if ((A == -0.0) && (B == +@.0)) return 
if ((A == +0.0) && (B == -@.0)) return 
if ((A == NaN) && (B == NaN)) return 
if (A == NaN) return 
if (B == NaN) return 
if (A == -inf) return 
if (B == -inf) return 
if (A == +inf) return 
if (B == +inf) return 
if (A >= B) return 
return B; 


} 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc3[511:@] 


} else { 


tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 


} 


zmm3[511:0] 


for (n = @; n < 16; n++) { 


if(ki[n] != @) { 
1 = 32&n 
// float32 operation 
zmm1[i+31:i] = FpMax(zmm2[it+31:i] , tmpSrc3[i+31:i]) 


a 
t 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 


Flush Tiny Results To Zero : 


(MXCSR.DAZ)? YES : 


NO 
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Memory Up-conversion: S 32 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S -35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_gmax_ps (_m512,__m512); 
—m512 _mm512_mask_gmax_ps (_m512,__mmask16,__m512,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
#UD 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W1 vgminpd zmm1 {k1}, zmm2, Determine the minimum of float64 vector 
52 /r Sea(zmm3/mz) zmm2z2 and float64 vector S'fg4(zmm3/m;,) and 


store the result in zmm1, under write-mask. 


Description 


Determines the minimum value of each pair of corresponding elements in float64 vec- 
tor zmm2 and the float64 vector result of the swizzle/broadcast/conversion process on 
memory or float64 vector zmm3. The result is written into float64 vector zmm1. 


FpMin() returns the smaller of the two float32 arguments, following IEEE in general. NaN 
has special handling: If one source operand is NaN, then the other source operand is re- 
turned (choice made per-component). If both are NaN, then the unchanged NaN from the 
first source (here zmm2) is returned. Please note that if first source is a SNaN it won't be 
quietized, it will be returned without any modification. This differs from the new IEEE 
754-08 rules, which states that in case of an input SNaN, its quietized version should be 
returned instead of the other value. 


Another new IEEE 754-08 rule is that min(-0,+0) == min(+0,-0) == -0, which honors the 
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recom- 
mends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the 
result of comparing zeros to be dependent on the order of parameters, using a comparison 
that ignores the signs. 


This instruction treats input denormals as zeros according to the DAZ control bit, but it 
does not flush tiny results to zero. 


The following table describes exception flags priority: 


Input 1 Input 2 Flags Comments 

SNAN denormal #] #] priority over #D 
denormal SNAN #] #] priority over #D 

QNAN denormal | none QNaN rule priority over #D 
denormal QNAN none QNaNrule priority over #D 
normal denormal | #D__ only if DAZ=0 

denormal normal #D only if DAZ=0 

denormal denormal | #D__ only if DAZ=0 


Table 6.23: Min exception flags priority 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 

FpMin(A,B) 

{ 
if ((A == -0.0) && (B == +@.0)) return 
if ((A == +0.0) && (B == -@.0)) return 
if ((A == NaN) && (B == NaN)) return 
if (A == NaN) return 
if (B == NaN) return 
if (A == -inf) return 
if (B == -inf) return 
if (A == tinf) return 
if (B == +inf) return 
if (A < B) return 
return B; 


} 


rrwwdrrwrw yp 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc3[511:@] 


} else { 


tmpSrc3[511:0] = SwizzUpConvLoad ¢g4(zmm3/m,) 


} 


zmm3[511:0] 


for (n = @; n < 8; n++) { 


if(ki[n] != @) { 
i = 64&n 
// float64 operation 
zmm1[it+63:i] = FpMin(zmm2[i+63:i] , tmpSrc3[i+63:i]) 


a 
t 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 


(MXCSR. DAZ)? 


YES : NO 


Flush Tiny Results To Zero : 


NO 
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Memory Up-conversion: S ¢¢4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S r¢4 


MVEX.EH=0 

S55S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_gmin_pd (_m512d,_m512d); 
—m512d _mm512_mask_gmin_pd (_m512d,__mmask8,_m512d,_m512d); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
#UD 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vgminps zmm1 {k1}, zmm2, Determine the minimum of float32 vector 
52 /r S'p32(zmm3/mz) zmm2z2 and float32 vector S'f32(zmm3/m,) and 


store the result in zmm1, under write-mask. 


Description 


Determines the minimum value of each pair of corresponding elements in float32 vec- 
tor zmm2 and the float32 vector result of the swizzle/broadcast/conversion process on 
memory or float32 vector zmm3. The result is written into float32 vector zmm1. 


FpMin() returns the smaller of the two float32 arguments, following IEEE in general. NaN 
has special handling: If one source operand is NaN, then the other source operand is re- 
turned (choice made per-component). If both are NaN, then the unchanged NaN from the 
first source (here zmm2) is returned. Please note that if first source is a SNaN it won't be 
quietized, it will be returned without any modification. This differs from the new IEEE 
754-08 rules, which states that in case of an input SNaN, its quietized version should be 
returned instead of the other value. 


Another new IEEE 754-08 rule is that min(-0,+0) == min(+0,-0) == -0, which honors the 
sign, in contrast to the comparison rules for signed zero (stated above). D3D10.0 recom- 
mends the IEEE 754-08 behavior here, but it will not be enforced; it is permissible for the 
result of comparing zeros to be dependent on the order of parameters, using a comparison 
that ignores the signs. 


This instruction treats input denormals as zeros according to the DAZ control bit, but it 
does not flush tiny results to zero. 


The following table describes exception flags priority: 


Input 1 Input 2 Flags Comments 

SNAN denormal #] #] priority over #D 
denormal SNAN #] #] priority over #D 

QNAN denormal | none QNaN rule priority over #D 
denormal QNAN none QNaNrule priority over #D 
normal denormal | #D__ only if DAZ=0 

denormal normal #D only if DAZ=0 

denormal denormal | #D__ only if DAZ=0 


Table 6.24: Min exception flags priority 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 

FpMin(A,B) 

{ 
if ((A == -0.0) && (B == +@.0)) return 
if ((A == +0.0) && (B == -@.0)) return 
if ((A == NaN) && (B == NaN)) return 
if (A == NaN) return 
if (B == NaN) return 
if (A == -inf) return 
if (B == -inf) return 
if (A == tinf) return 
if (B == +inf) return 
if (A < B) return 
return B; 


} 


rrwwdrrwrw yp 


if(source is a register operand and MVEX.EH bit is 1) { 


if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc3[511:@] 


} else { 


tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 


} 


zmm3[511:0] 


for (n = @; n < 16; n++) { 


if(ki[n] != @) { 
1 = 32&n 
// float32 operation 
zmm1[i+31:i] = FpMin(zmm2[it+31:i] , tmpSrc3[i+31:i]) 


a 
t 


SIMD Floating-Point Exceptions 


Invalid, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 


Flush Tiny Results To Zero : 


(MXCSR.DAZ)? YES : 


NO 
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Memory Up-conversion: S 32 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S -35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_gmin_ps(_m512,__m512); 
—m512 _mm512_mask_gmin_ps(_m512,__mmask16,_m512,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
#UD 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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(intel. 


VLOADUNPACKHD - Load Unaligned High And Unpack To Doubleword Vec- 


tor 


348 


Opcode Instruction Description 
MVEX.512.0F38.W0 D4 /r = vioadunpackhd = zmmi1_ {k1}, Load high 64-byte-aligned portion of unaligned 
Uiz2(me) doubleword stream Uj32(m; - 64), unpack 
mask-enabled elements that fall in that portion, 
and store those elements in doubleword vector 
zmm1, under write-mask. 
Description 


The high-64-byte portion of the byte/word/doubleword stream starting at the element- 
aligned address (m,—64) is loaded, converted and expanded into the write-mask-enabled 
elements of doubleword vector zmm1. The number of set bits in the write-mask deter- 
mines the length of the converted doubleword stream, as each converted doubleword is 
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked 
elements of zmm1. 


This instruction only transfers those converted doublewords (if any) in the stream that 
occur at or after the first 64-byte-aligned address following (m; — 64) (that is, in the high 
cache line of the memory stream for the current implementation). Elements in zmm1 
that don't map to those stream doublewords are left unchanged. The vloadunpackld in- 
struction is used to load the part of the stream before the first 64-byte-aligned address 
preceding m:. 


In conjunction with vloadunpackld, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpackld, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFFFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpackld v0 {k1}, Lrax] 
vloadunpackhd v@ {k1}, Lrax+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
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#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
boundary. Additionally, A/D bits in the page table will not be updated. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
bit clear in vector mask1 retain their previous values. However, see above for unusual 
aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = Q 
upSize = UpConvLoadSizeOfj32(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = mm - 64 
for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
if (foundNext64BytesBoundary == false) { 
if ( ( ((pointer + (loadOffset+1)*upSize ) % 64) == 0) { 
foundNext64BytesBoundary = true 


} 
} else { 
i = 32x*n 
zmm1[i+31:i] = UpConvLoad;32(pointer + upSizexloadOffset) 
a 
loadOffsett+ 
} 
} 
Flags Affected 
None. 


Memory Up-conversion: U,35 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 1 

101 sint8 to sint32 [rax] {sint8} 1 

110 uint16 to uint32 [rax] {uint16} 2 

111 sint16 to sint32 [rax] {sint16} 2 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_extloadunpackhi_epi32 (_m512i, void const*, 
_MM_UPCONV_EPI32_ENUM, int); 

—m512i _mm512_mask_extloadunpackhi_epi32 (_m512i, _mmask16, void const*, 
_MM_UPCONV_EPI32_ENUM, int); 

—m512i _mm512_loadunpackhi_epi32 (_m512i, void const*); 

—m512i _mm512_mask_loadunpackhi_epi32 (_m512i,__mmask16, void const*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the UpConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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VLOADUNPACKHPD - Load Unaligned High And Unpack To Float64 Vector 


Opcode Instruction Description 
MVEX.512.0F38.W1D5 /r = vioadunpackhpd 9=zmmi1_ {k1}, Load high 64-byte-aligned portion of unaligned 
Urealme) float64 stream Uyga(m: - 64), unpack mask- 


enabled elements that fall in that portion, and 
store those elements in float64 vector zmm1, 
under write-mask. 


Description 


The high-64-byte portion of the quadword stream starting at the element-aligned address 
(m, — 64) is loaded, converted and expanded into the write-mask-enabled elements of 
quadword vector zmm1. The number of set bits in the write-mask determines the length 
of the converted quadword stream, as each converted quadword is mapped to exactly one 
of the quadword elements in zmm1, skipping over write-masked elements of zmm1. 


This instruction only transfers those converted quadwords (if any) in the stream that oc- 
cur at or after the first 64-byte-aligned address following (m, — 64) (that is, in the high 
cache line of the memory stream for the current implementation). Elements in zmm1 
that don't map to those stream quadwords are left unchanged. The vloadunpacklpd in- 
struction is used to load the part of the stream before the first 64-byte-aligned address 
preceding m,. 


In conjunction with vloadunpacklpd, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpacklpd, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpacklpd v@ {k1}, [rax] 
vloadunpackhpd v@ {k1}, [raxt+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
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boundary. Additionally, A/D bits in the page table will not be updated. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
bit clear in vector mask1 retain their previous values. However, see above for unusual 
aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = @ 
upSize = UpConvLoadSize0f ¢g4(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = 7m - 64 
for (n = @; n < 8; n++) { 
if(k1[n] != @) { 
if (foundNext64BytesBoundary == false) { 
if ( ( ((pointer + (loadOffsett+1)xupSize ) % 64) == @) { 
foundNext64BytesBoundary = true 
} 
} else { 
i = 644n 
zmm1[i+63:i] = UpConvLoadyg4(pointer + upSize*xloadOffset) 
F, 
loadOffsett+ 


SIMD Floating-Point Exceptions 


None. 


Memory Up-conversion: U ¢¢4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_extloadunpackhi_pd (_m512d, void const*, MM_UPCONV_PD_ENUM, 
int); 

_—m512d _mm512_mask_extloadunpackhi_pd (_m512d, _mmask8, void _ const*, 
_MM_UPCONV_PD_ENUM, int); 

—m512d _mm512_loadunpackhi_pd (_m512d, void const*); 

—m512d _mm512_mask_loadunpackhi_pd (_m512d,__mmask8, void const*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the UpConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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VLOADUNPACKHPS - Load Unaligned High And Unpack To Float32 Vector 
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Opcode Instruction Description 
MVEX.512.0F38.W0 D5 /r = vioadunpackhps = zmm1_= {k1}, Load high 64-byte-aligned portion of unaligned 
Ur32(me) float32 stream Us32(m; - 64), unpack mask- 


enabled elements that fall in that portion, and 
store those elements in float32 vector zmm1, 
under write-mask. 


Description 


The high-64-byte portion of the byte/word/doubleword stream starting at the element- 
aligned address (m,—64) is loaded, converted and expanded into the write-mask-enabled 
elements of doubleword vector zmm1. The number of set bits in the write-mask deter- 
mines the length of the converted doubleword stream, as each converted doubleword is 
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked 
elements of zmm1. 


This instruction only transfers those converted doublewords (if any) in the stream that 
occur at or after the first 64-byte-aligned address following (m; — 64) (that is, in the high 
cache line of the memory stream for the current implementation). Elements in zmm1 
that don't map to those stream doublewords are left unchanged. The viloadunpacklps in- 
struction is used to load the part of the stream before the first 64-byte-aligned address 
preceding m. 


In conjunction with vloadunpacklps, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpacklps, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFFFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpacklps v@ {k1}, [rax] 
vloadunpackhps v@ {k1}, [raxt+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
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#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
boundary. Additionally, A/D bits in the page table will not be updated. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
bit clear in vector mask1 retain their previous values. However, see above for unusual 
aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = Q 
upSize = UpConvLoadSizeOf ¢32(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = m - 64 
for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
if (foundNext64BytesBoundary == false) { 
if ( ( ((pointer + (loadOffsett+1)xupSize ) % 64) == 0) { 
foundNext64BytesBoundary = true 
3 
} else { 
i = 32n 
zmm1[it+31:i] = UpConvLoady32(pointer + upSize*loadOffset) 
} 
loadOffset++ 


SIMD Floating-Point Exceptions 


Invalid. 


Memory Up-conversion: U ;35 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_extloadunpackhi_ps (_.m512, void const*, _MM_UPCONV_PS_ENUM, 
int); 

—m512 _mm512_mask_extloadunpackhi_ps (_m512, _mmask16, void _ const*, 
_MM_UPCONV_PS_ENUM, int); 

—m512 _mm512_loadunpackhi_ps (_m512, void const*); 

—m512 _mm512_mask_loadunpackhi_ps (_m512,__mmask16, void const*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the UpConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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VLOADUNPACKHQ - Load Unaligned High And Unpack To Int64 Vector 


Opcode Instruction Description 
MVEX.512.0F38.W1 D4 /r = vioadunpackhq) = zmmi1_ {k1}, Load high 64-byte-aligned portion of unaligned 
Uiga(™me) int64 stream Ujeg4(m, - 64), unpack mask- 


enabled elements that fall in that portion, and 
store those elements in int64 vector zmm1, un- 
der write-mask. 


Description 


The high-64-byte portion of the quadword stream starting at the element-aligned address 
(m, — 64) is loaded, converted and expanded into the write-mask-enabled elements of 
quadword vector zmm1. The number of set bits in the write-mask determines the length 
of the converted quadword stream, as each converted quadword is mapped to exactly one 
of the quadword elements in zmm1, skipping over write-masked elements of zmm1. 


This instruction only transfers those converted quadwords (if any) in the stream that oc- 
cur at or after the first 64-byte-aligned address following (m, — 64) (that is, in the high 
cache line of the memory stream for the current implementation). Elements in zmm1 that 
don't map to those stream quadwords are left unchanged. The vloadunpacklq instruction 
is used to load the part of the stream before the first 64-byte-aligned address preceding 
Mt. 


In conjunction with vloadunpacklq, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpacklq, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpacklq v0 {k1}, Lrax] 
vloadunpackhq v@ {k1}, Lrax+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
boundary. Additionally, A/D bits in the page table will not be updated. 
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This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
bit clear in vector mask1 retain their previous values. However, see above for unusual 
aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = Q 
upSize = UpConvLoadSizeOfjg4(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = ™m - 64 
for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
if (foundNext64BytesBoundary == false) { 
if ( ( ((pointer + (loadOffset+1)*upSize ) % 64) == 0) { 
foundNext64BytesBoundary = true 


} 
} else { 
i = 64x*n 
zmm1[i+63:i] = UpConvLoadjg4(pointer + upSizexloadOffset) 
} 
loadOffsett+ 
} 
I 
Flags Affected 
None. 


Memory Up-conversion: U,¢, 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_extloadunpackhi_epi64 (_m512i, void const*, 
_MM_UPCONV_EPI64_ENUM, int); 
—m512i _mm512_mask_extloadunpackhi_epi64 (_m512i, _mmask8, void const*, 


_MM_UPCONV_EPI64_ENUM, int); 
—m512i _mm512_loadunpackhi_epi64 (_m512i, void const*); 
—m512i _mm512_mask_loadunpackhi_epi64 (_m512i,__mmask8, void const*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the UpConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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VLOADUNPACKLD - Load Unaligned Low And Unpack To Doubleword Vec- 


tor 


360 


The low-64-byte portion of the byte/word/doubleword stream starting at the element- 
aligned address m;, is loaded, converted and expanded into the write-mask-enabled el- 
ements of doubleword vector zmm1. The number of set bits in the write-mask deter- 
mines the length of the converted doubleword stream, as each converted doubleword is 
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked 
elements of zmm1. 


This instruction only transfers those converted doublewords (if any) in the stream that 
occur before the first 64-byte-aligned address following m;, (thatis, in the low cache line of 
the memory stream in the current implementation). Elements in zmm1 that don't map to 
those converted stream doublewords are left unchanged. The vloadunpackhd instruction 
is used to load the part of the stream at or after the first 64-byte-aligned address preceding 
Mt. 


In conjunction with vloadunpackhd, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpackhd, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFFFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpackld v0 {k1}, Lrax] 
vloadunpackhd v@ {k1}, Lrax+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 


Opcode Instruction Description 
MVEX.512.0F38.W0 DO /r = vioadunpackld = zmmt1 {k1}, Load low 64-byte-aligned portion of unaligned 
Uiz2(me) doubleword stream Uj32(m:), unpack mask- 
enabled elements that fall in that portion, and 
store those elements in doubleword vector 
zmm1, under write-mask. 
Description 
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This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
bit clear in vector mask1 retain their previous values. However, see above for unusual 


aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = Q 
upSize = UpConvLoadSizeOfj32(SSS[2:0]) 


for(n = @ ;n < 16; n++) { 
i = 32xn 
if (k1[n] != @) { 
zmm1[i+31:i] = UpConvLoad;32(m,+upSizexloadOffset) 
loadOffset++ 
if ( ( (7m + upSizexloadOffset) % 64) == 0) { 
break 
} 
} 
} 


Flags Affected 


None. 


Memory Up-conversion: U,35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 1 

101 sint8 to sint32 [rax] {sint8} 1 

110 uint16 to uint32 [rax] {uint16} 2 

111 sint16 to sint32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_extloadunpacklo_epi32 (_m512i, void 
_MM_UPCONV_EPI32_ENUM, int); 


const*, 


_m512i _mm512_mask_extloadunpacklo_epi32 (_m512i, _mmask16, void const*, 


_MM_UPCONV_EPI32_ENUM, int); 
_m512i _mm512_loadunpacklo_epi32 (_m512i, void const*); 


_m512i _mm512_mask_loadunpacklo_epi32 (_m512i,__mmask16, void const*); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the UpConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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VLOADUNPACKLPD - Load Unaligned Low And Unpack To Float64 Vector 


Opcode Instruction Description 
MVEX.512.0F38.W1D1/r  vioadunpacklpd zmmi1_ {k1}, Load low 64-byte-aligned portion of unaligned 
Urealme) float64 stream U ¢4(m), unpack mask-enabled 


elements that fall in that portion, and store 
those elements in float64 vector zmm1, under 
write-mask. 


Description 


The low-64-byte portion of the quadword stream starting at the element-aligned address 
me is loaded, converted and expanded into the write-mask-enabled elements of quadword 
vector zmm1. The number of set bits in the write-mask determines the length of the con- 
verted quadword stream, as each converted quadword is mapped to exactly one of the 
quadword elements in zmm1, skipping over write-masked elements of zmm1. 


This instruction only transfers those converted quadwords (if any) in the stream that oc- 
cur before the first 64-byte-aligned address following m, (that is, in the low cache line of 
the memory stream in the current implementation). Elements in zmm1 that don't map to 
those converted stream quadwords are left unchanged. The vloadunpackhg instruction is 
used to load the part of the stream at or after the first 64-byte-aligned address preceding 
Mt. 


In conjunction with vloadunpackhpd, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpackhpd, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpacklpd v@ {k1}, [rax] 
vloadunpackhpd v@ {k1}, [raxt+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
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bit clear in vector mask1 retain their previous values. However, see above for unusual 
aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = Q 
upSize = UpConvLoadSize0f ¢g4(SSS[2:0]) 


for(n = @ ;n < 8; nt+) { 
i = 64x*n 
if (ki[n] != @) { 
zmm1[i+63:i] = UpConvLoad ¢g4(m,tupSize*loadOffset) 
loadOffset++ 
if ( ( (7m + upSizexloadOffset) % 64) == 0) { 
break 
} 
} 
} 


SIMD Floating-Point Exceptions 


None. 


Memory Up-conversion: U s¢4 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_extloadunpacklo_pd (_m512d, void const*, MM_UPCONV_PD_ENUM, 
int); 

_—m512d _mm512_mask_extloadunpacklo_pd (__m512d, _mmask8, void _ const*, 
_MM_UPCONV_PD_ENUM, int); 

—m512d _mm512_loadunpacklo_pd (_m512d, void const*); 

_—m512d _mm512_mask_loadunpacklo_pd (_m512d,_mmask8, void const*); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 
#GP(0) 
#PF(fault-code) 


#NM 
#UD 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the UpConv. 
For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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VLOADUNPACKLPS - Load Unaligned Low And Unpack To Float32 Vector 


366 


Opcode Instruction Description 
MVEX.512.0F38.W0 D1 /r = vioadunpacklps zmm1_ {k1}, Load low 64-byte-aligned portion of unaligned 
Ur32(me) float32 stream U 32(m,), unpack mask-enabled 


elements that fall in that portion, and store 
those elements in float32 vector zmm1, under 
write-mask. 


Description 


The low-64-byte portion of the byte/word/doubleword stream starting at the element- 
aligned address m; is loaded, converted and expanded into the write-mask-enabled el- 
ements of doubleword vector zmm1. The number of set bits in the write-mask deter- 
mines the length of the converted doubleword stream, as each converted doubleword is 
mapped to exactly one of the doubleword elements in zmm1, skipping over write-masked 
elements of zmm1. 


This instruction only transfers those converted doublewords (if any) in the stream that 
occur before the first 64-byte-aligned address following m;, (thatis, in the low cache line of 
the memory stream in the current implementation). Elements in zmm1 that don't map to 
those converted stream doublewords are left unchanged. The vloadunpackhd instruction 
is used to load the part of the stream at or after the first 64-byte-aligned address preceding 
Mt. 


In conjunction with vloadunpackhps, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpackhps, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFFFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpacklps v@ {k1}, [rax] 
vloadunpackhps v@ {k1}, [raxt+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
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vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
bit clear in vector mask1 retain their previous values. However, see above for unusual 
aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = @ 
upSize = UpConvLoadSizeOf p32 (SSS[2:0]) 


for(n = @ ;n < 16; n++) { 
i = 32xn 
if (k1[n] != @) { 
zmm1[i+31:i] = UpConvLoad ¢32(m,tupSize*loadOffset) 
loadOffset++ 
if ( ( (7m + upSizexloadOffset) % 64) == 0) { 
break 
} 
} 
J 


SIMD Floating-Point Exceptions 


Invalid. 


Memory Up-conversion: U;35 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_extloadunpacklo_ps (_m512, void const*, _MM_UPCONV_PS_ENUM, 
int); 

—m512 _mm512_mask_extloadunpacklo_ps (_m512, _mmask16, void const*, 
_MM_UPCONV_PS_ENUM, int); 

_—m512 _mm512_loadunpacklo_ps (_m512, void const*); 

—m512 _mm512_mask_loadunpacklo_ps (_m512,__mmask16, void const*); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the UpConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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VLOADUNPACKLQ - Load Unaligned Low And Unpack To Int64 Vector 


Opcode Instruction Description 
MVEX.512.0F38.W1 D0 /r = vioadunpacklq) = zmmt1 {k1}, Load low 64-byte-aligned portion of unaligned 
Uiga(™me) int64 stream Ujg¢4(m,), unpack mask-enabled 


elements that fall in that portion, and store 
those elements in int64 vector zmm1, under 
write-mask. 


Description 


The low-64-byte portion of the quadword stream starting at the element-aligned address 
me is loaded, converted and expanded into the write-mask-enabled elements of quadword 
vector zmm1. The number of set bits in the write-mask determines the length of the con- 
verted quadword stream, as each converted quadword is mapped to exactly one of the 
quadword elements in zmm1, skipping over write-masked elements of zmm1. 


This instruction only transfers those converted quadwords (if any) in the stream that oc- 
cur before the first 64-byte-aligned address following m, (that is, in the low cache line of 
the memory stream in the current implementation). Elements in zmm1 that don't map to 
those converted stream quadwords are left unchanged. The vloadunpackhg instruction is 
used to load the part of the stream at or after the first 64-byte-aligned address preceding 
Tt. 


In conjunction with vloadunpackhg, this instruction is useful for re-expanding data that 
was packed into a queue. Also in conjunction with vloadunpackhg, it allows unaligned 
vector loads (that is, vector loads that are only element-wise, not vector-wise, aligned); 
use a mask of OxFF or no write-mask for this purpose. The typical instruction sequence 
to perform an unaligned vector load would be: 


// assume memory location is pointed by register rax 
vloadunpacklq v0 {k1}, Lrax] 
vloadunpackhq v@ {k1}, Lrax+64] 


This instruction does not have broadcast support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note that this instruction will always access 64 bytes of memory. The memory region 
accessed will always be between linear_address & (~0x3F) and (linear_address & (~0x3F)) 
+ 63 boundaries. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are modified in zmm1. Elements in zmm1 with the corresponding 
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bit clear in vector mask1 retain their previous values. However, see above for unusual 
aspects of the write-mask's operation with this instruction. 


Operation 


loadOffset = Q 
upSize = UpConvLoadSizeOf jg4(SSS[2:0]) 


for(n = @ ;n < 8; ntt) { 
i = 64x*n 
if (k1[n] != @) { 
zmm1[i+63:i] = UpConvLoadjg4(m,t+upSizexloadOffset) 
loadOffset++ 
if ( ( (7m + upSizexloadOffset) % 64) == 0) { 
break 
} 
} 
} 


Flags Affected 


None. 


Memory Up-conversion: U4 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_extloadunpacklo_epi64 (_m512i, void const*, 
_MM_UPCONV_EPI64_ENUM, int); 
_m512i _mm512_mask_extloadunpacklo_epi64 (_m512i, _mmask8, void const*, 


_MM_UPCONV_EPI64_ENUM, int); 
_m512i _mm512_loadunpacklo_epi64 (_m512i, void const*); 
_m512i _mm512_mask_loadunpacklo_epi64 (_m512i,__mmask8, void const*); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 
#GP(0) 
#PF(fault-code) 


#NM 
#UD 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the UpConv. 
For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the second operand is not a memory location. 
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Opcode Instruction Description 
MVEX.512.66.0F38.W0 C9 _ vilog2ps zmm1 {k1}, zmm2/m; Calculate logarithm from float32 vector 
/v zmm2/m, and store the result in zmm1, under 


write-mask. 


Description 


Computes the element-by-element logarithm base-2 of the float32 vector on memory or 
float32 vector zmm2. The result is written into float32 vector zmm1. 


1. 4ulp of relative error when the source value is within the intervals (0, 0.5) or (2, co] 
2. absolute error less than 2~! within the interval [0.5, 2] 


For an input value of +/ — 0 the instruction returns —co and sets the Divide-By-Zero 
flag (#Z). Negative numbers (including —oo) should return the canonical NaN and set the 
Invalid flag (#1). Note however that this instruction treats input denormals as zeros of 
the same sign, so for denormal negative inputs it returns —oo and sets the Divide-By-Zero 
status flag. If any source element is NaN, the quietized NaN source value is returned for 
that element (and #1 is raised for input sNaNs). 


Current implementation of this instruction does not support any SwizzUpConv setting 
other than "no broadcast and no conversion"; any other SwizzUpConv setting will result 
in an Invalid Opcode exception. 


log2_DX() function follows Table 6.25 when dealing with floating-point special numbers. 


Input | Result Comments 

NaN | input qNaN | Raise #I flag if sNaN 
+00 +00 

+0 —oo Raise #Z flag 

—0 —oo Raise #Z flag 

<0 NaN Raise #I flag 

—oo NaN Raise #I flag 

2" n Exact integral result 


Table 6.25: vlog2_DX() special floating-point values behavior 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


tmpSrc2[511:0] = zmm2/m; 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
} 
for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
i = 32x*n 
zmm1[i+31:i] = vlog2_DX(tmpSrc2[i+31:i]) 
} 
a; 


SIMD Floating-Point Exceptions 


Invalid, Zero. 


Denormal Handling 


Treat Input Denormals As Zeros : 
YES 


Flush Tiny Results To Zero : 


YES 

Register Swizzle 
MVEX.EH=0 
S55S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 reserved N/A 
010 reserved N/A 
011 reserved N/A 
100 reserved N/A 
101 reserved N/A 
110 reserved N/A 
111 reserved N/A 
MVEX.EH=1 
S$25Sq || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_log2_ps (_m512); 
—m512 _mm512_mask_log2_ps (_m512,__mmask16,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 
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VMOVAPD - Move Aligned Float64 Vector 


Opcode Instruction 
MVEX.512.66.0RW1 28 /r = vmovapd zmm1 {k1}, Usga(me) 


MVEX.512.66.0FW1 28 /r = vmovapdzmm1 {k1}, Sy¢4(zmm2) 


MVEX.512.66.0RW1 29 /r = vmovapd m; {k1}, Dyea(zmm1) 


Description 

Move float64 vector Uyea(m:) into vector 
zmm1, under write-mask. 

Move float64 vector S¢4(zmm2) into vector 
zmm1, under write-mask. 

Move float64 vector Dyg4(zmm1) into m;, un- 
der write-mask. 


Description 


Moves float64 vector result of the swizzle/broadcast/conversion process on memory or 
float64 vector zmmz2 into float64 vector zmm1 or down-converts and stores float64 vector 


zmmz2 into destination memory. 


This instruction is write-masked, so only those elements with the corresponding bit(s) set 
in the vector mask (k1) register are computed and stored into register/memory. Elements 
in register/memory with the corresponding bit(s) clear in the vector mask register are 


maintained with the previous value. 


Operation 


DESTINATION IS A VECTOR OPERAND 
if(source is a register operand) { 
if (MVEX.EH==1) { 
tmpSrc2[511:0] = zmm2[511:0] 
} else { 
tmpSrc2[511:@] = SwizzUpConvLoad rg4(zmm2) 


} 
} else { 
tmpSrc2[511:0] = UpConvLoad fea (me) 
} 
for (n = @; n < 8; n++) { 
if (ki[n] != @) { 
i = 64*n 
zmm1[it+63:i] = tmpSrc2[it+63:i]) 
} 
} 


DESTINATION IS A MEMORY OPERAND 
downSize = DownConvStoreSizeOf fg4(SSS[2:0]) 
for(n = @ ;n < 8; nt++) { 
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if (k1[n] != @) { 
i = 644n 
tmp = DownConvStore g4(zmm1[i+63:i], SSS[2:0]) 
if(downSize == 8) { 

MemStore(m,+8*n) = tmp[63:0] 

} 

} 

} 


SIMD Floating-Point Exceptions 


DESTINATION IS A VECTOR OPERAND: None. 
DESTINATION IS A MEMORY OPERAND: None. 


Memory Up-conversion: U s¢4 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S ¢¢4 


MVEX.EH=0 

S25S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 
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Memory Down-conversion: D ¢¢4 

525159 || Function: Usage disp8*N 
000 no conversion zmm1 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_m512d 


Exceptions 


_mm512_mask_mov_pd (_m512d,_mmask8,___m512d); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


#GP(0) 


#PF(fault-code) 


#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 


in a non-canonical form. 


If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
For a page fault. 
If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMOVAPS - Move Aligned Float32 Vector 


Opcode Instruction Description 
zmm1, under write-mask. 


zmm1, under write-mask. 


der write-mask. 


MVEX.512.0F.W0O 28 /r vmovaps zmm1 {k1}, Ur32(mz) Move float32 vector Uy32(m,) into vector 
MVEX.512.0F.WO 28 /r vmovaps zmm1 {k1}, Si32(zmm2) Move float32 vector S32(zmm2) into vector 


MVEX.512.0F.W0 29 /r vmovaps m, {k1}, D32(zmm1) Move float32 vector D32(zmm1) into m;, un- 


Description 


Moves float32 vector result of the swizzle/broadcast/conversion process on memory or 
float32 vector zmm2 into float32 vector zmm1 or down-converts and stores float32 vector 
zmmz2 into destination memory. 


This instruction is write-masked, so only those elements with the corresponding bit(s) set 
in the vector mask (k1) register are computed and stored into register/memory. Elements 
in register/memory with the corresponding bit(s) clear in the vector mask register are 
maintained with the previous value. 


Operation 


DESTINATION IS A VECTOR OPERAND 
if(source is a register operand) { 
if (MVEX.EH==1) { 
tmpSrc2[511:0] = zmm2[511:0] 
} else { 
tmpSrc2[511:0] 


SwizzUpConvLoad p32 (zmm2) 
} 

} else { 
tmpSrc2[511:0] = UpConvLoad 32 (m+) 

} 


for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
i = 324n 
zmm1[it+31:i] = tmpSrc2[it+31:i]) 
} 
} 


DESTINATION IS A MEMORY OPERAND 


downSize = DownConvStoreSize0f ¢32(SSS[2:0]) 


for(n = @ ;n < 16; n++) { 
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if (k1[n] != @) { 
i = 324n 
tmp = DownConvStore 32 (zmm1[it+31:i], SSS[2:0]) 
if(downSize == 4) { 
MemStore(m,+4*n) = tmp[31:0] 
} else if(downSize == 2) { 
MemStore(m,+2*n) = tmp[15:0] 
} else if(downSize == 1) { 
MemStore(m,tn) = tmp[7:0] 
} 
} 
} 


SIMD Floating-Point Exceptions 


DESTINATION IS A VECTOR OPERAND: Invalid. 


DESTINATION IS A MEMORY OPERAND: Overflow, Underflow, Invalid, Precision, Denor- 
mal. 


Memory Up-conversion: U ;35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

101 sint8 to float32 [rax] {sint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ;35 


MVEX.EH=0 

S2S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 
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Memory Down-conversion: D;35 


S25S1Spo || Function: Usage disp8*N 
000 no conversion zmm1 64 
001 reserved N/A N/A 
010 reserved N/A N/A 
011 float32 to float16 zmm1 {float16} 32 
100 float32 to uint8 zmm1 {uint8} 16 
101 float32 to sint8 zmm1 {sint8} 16 
110 float32 to uint16 zmm1 {uint16} 32 
111 float32 to sint16 zmm1 {sint16} 32 
Intel’ C/C++ Compiler Intrinsic Equivalent 
—m512 _mm512_mask_mov_ps (_m512,__mmask16,_m512); 
Exceptions 
Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 
Protected and Compatibility Mode 
#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
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If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 


#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMOVDQA32 - Move Aligned Int32 Vector 


MVEX.512.66.0R WO 7F /r 


Opcode Instruction 

MVEX.512.66.0RW06F /r  vmovdqa32 zmmt1 {k1}, U;32(m:) 

MVEX.512.66.0F.W0 6F /r = vmovdqa32 zmm1 {k1}, 
Sizo(zmm2) 


vmovdgqa32 m; {k1}, Dj32(zmm1) 


Description 

Move int32 vector U;32(m,) into vector zmm1, 
under write-mask. 

Move int32 vector Sj;32(zmm2) into vector 
zmm1, under write-mask. 

Move int32 vector Dj32(zmm1) into m;, under 
write-mask. 


Description 


Moves int32 vector result of the swizzle/broadcast/conversion process on memory or 
int32 vector zmm2 into int32 vector zmm1 or down-converts and stores int32 vector 


zmmz2 into destination memory. 


This instruction is write-masked, so only those elements with the corresponding bit(s) set 
in the vector mask (k1) register are computed and stored into register/memory. Elements 
in register/memory with the corresponding bit(s) clear in the vector mask register are 


maintained with the previous value. 


Operation 
DESTINATION IS A VECTOR OPERAND 


if(source is a register operand) { 
if (MVEX.EH==1) { 
tmpSrc2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32 (zmm2) 


zmm2[511:0] 


} 
} else { 
tmpSrc2[511:0] = UpConvLoad;32 (mz) 


Z 


for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
i = 324n 
zmm1[it+31:i] = tmpSrc2[i+31:i]) 
} 
} 


DESTINATION IS A MEMORY OPERAND 


downSize = DownConvStoreSizeOfj32(SSS[2:0]) 
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for(n = @ ;n < 16; n++) { 
if (k1[n] != @) { 


J 
} 


i = 324n 
tmp = DownConvStore;32(zmm1[i+31:i], SSS[2:0]) 
if(downSize == 4) { 
MemStore(m:+4*n) = tmp[31:0] 
} else if(downSize == 2) { 
MemStore(m,+2*n) = tmp[15:0] 
} else if(downSize == 1) { 
MemStore(m,tn) = tmp[7:0] 
} 


Flags Affected 


DESTINATION IS A VECTOR OPERAND: None. 
DESTINATION IS A MEMORY OPERAND: None. 


Memory Up-conversion: U,;5 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S2S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 
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Memory Down-conversion: D;32 

525159 || Function: Usage disp8*N 
000 no conversion zmm1 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint32 to uint8 zmm1 {uint8} 16 

101 sint32 to sint8 zmm1 {sint8} 16 

110 uint32 to uint16 zmm1 {uint16} 32 

111 sint32 to sint16 zmm1 {sint16} 32 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i 


Exceptions 


_mm512_mask_mov_epi32 (_m512i,__mmask16,__m512i); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMOVDQAG64 - Move Aligned Int64 Vector 


Instruction Description 


vmovdqa64 zmm1 {k1}, Viga(me) 


Opcode 
MVEX.512.66.0F.W1 6F /r 

under write-mask. 
MVEX.512.66.0FW16F /r §vmovdqa64 zmm1 {k1}, 
Siea(zmm2) 


vmovdqa64 m; {k1}, Dig4(zmm1) 


zmm1, under write-mask. 
MVEX.512.66.0RW1 7F /r 


write-mask. 


Move int64 vector Ujg4(m,) into vector zmm1, 
Move int64 vector Sjg4(zmm2) into vector 


Move int64 vector Djg4(zmm1) into m:, under 


Description 


Moves int64 vector result of the swizzle/broadcast/conversion process on memory or 
int64 vector zmm2 into int64 vector zmm1 or down-converts and stores int64 vector 
zmmz2 into destination memory. 


This instruction is write-masked, so only those elements with the corresponding bit(s) set 
in the vector mask (k1) register are computed and stored into register/memory. Elements 
in register/memory with the corresponding bit(s) clear in the vector mask register are 


maintained with the previous value. 


Operation 
DESTINATION IS A VECTOR OPERAND 


if(source is a register operand) { 
if (MVEX.EH==1) { 
tmpSrc2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad,64(zmm2) 


zmm2[511:0] 


} 
} else { 
tmpSrc2[511:0] = UpConvLoad;g4 (mz) 


Z 


for (n = @; n < 8; n++) { 
if (k1[n] != @) { 
i = 644n 
zmm1[it+63:i] = tmpSrc2[it+63:i]) 
} 
} 


DESTINATION IS A MEMORY OPERAND 


downSize = DownConvStoreSizeOf jg4(SSS[2:0]) 
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for(n = @ ;n < 8; nt+) { 
if (k1[n] != @) { 
i = 644n 


tmp = DownConvStorejg4(zmm1[i+63:i], SSS[2:0]) 


if(downSize == 8) { 
MemStore(m,+8*n) = tmp[63:0] 
} 
} 
} 


Flags Affected 


DESTINATION IS A VECTOR OPERAND: None. 
DESTINATION IS A MEMORY OPERAND: None. 


Memory Up-conversion: U4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S;¢4 


MVEX.EH=0 

S2S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 
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Memory Down-conversion: Dj¢4 


525159 || Function: Usage disp8*N 
000 no conversion zmm1 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i 


Exceptions 


_mm512_mask_mov_epi64 (_m512i,__mmask8, _m512i); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


#GP(0) 


#PF(fault-code) 


#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 


in a non-canonical form. 


If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
For a page fault. 
If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMOVNRAPD - Store Aligned Float64 Vector With No-Read Hint 


Opcode Instruction Description 

MVEX.512.F3.0F.W1.EHO vmovnrapd m {k1}, Dyeg4(zmm1) Store with No-Read hint float64 vector 

29 /r D y64(zmm1) into m, under write-mask. 
Description 


Stores float64 vector zmm1 (or a down-converted version of it) into destination memory 
with a No-Read hint for the case the whole vector is going to be written into memory. This 
instruction is intended to speed up the case of stores in streaming kernels where we want 
to avoid wasting memory bandwidth by being forced to read the original content of entire 
cache lines from memory when we overwrite their whole contents completely. 


In Knights Corner, this instruction is able to optimize memory bandwidth in case of acache 
miss and avoid reading the original contents of the memory destination operand if the 
following conditions hold true: 


¢ The instruction does not use a write-mask (MVEX.aaa=000). 
¢ The instruction does not perform any kind of down-conversion (MVEX.SSS=000). 


Note that this instruction is encoded by forcing MVEX.EH bit to 0. The Eviction Hint does 
not have any effect on this instruction. 


The No-Read directive is intended as a performance hint and could be ignored by a given 
processor implementation. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are stored to memory. Elements in the destination memory 
vector with the corresponding bit clear in k1 register retain their previous value. 


Operation 
DESTINATION IS A MEMORY OPERAND 


downSize = DownConvStoreSize0f fg4(SSS[2:0]) 


for(n = @ ;n < 8; nt+) { 
if (k1[n] != @) { 
i = 644n 
tmp = DownConvStore g4(zmm1[i+63:i], SSS[2:0]) 
if(downSize == 8) { 
MemStore(m,+8*n) = tmp[63:0] 
} 
} 
} 
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SIMD Floating-Point Exceptions 


None. 


Memory Down-conversion: D r¢4 


S25159 || Function: Usage disp8*N 
000 no conversion zmm1 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_storenr_pd(void*, __m512d); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 


#GP(0) 


#PF(fault-code) 


#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 


in a non-canonical form. 


If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
For a page fault. 
If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMOVNRAPS - Store Aligned Float32 Vector With No-Read Hint 


Opcode Instruction Description 

MVEX.512.F2.0F.W0.EHO vmovnraps m {k1}, Df32(zmm1) Store with No-Read hint float32 vector 

29 /r D 32(zmm1) into m, under write-mask. 
Description 


Stores float32 vector zmm1 (or a down-converted version of it) into destination memory 
with a No-Read hint for the case the whole vector is going to be written into memory. This 
instruction is intended to speed up the case of stores in streaming kernels where we want 
to avoid wasting memory bandwidth by being forced to read the original content of entire 
cache lines from memory when we overwrite their whole contents completely. 


In Knights Corner, this instruction is able to optimize memory bandwidth in case of acache 
miss and avoid reading the original contents of the memory destination operand if the 
following conditions hold true: 


¢ The instruction does not use a write-mask (MVEX.aaa=000). 
¢ The instruction does not perform any kind of down-conversion (MVEX.SSS=000). 


Note that this instruction is encoded by forcing MVEX.EH bit to 0. The Eviction Hint does 
not have any effect on this instruction. 


The No-Read directive is intended as a performance hint and could be ignored by a given 
processor implementation. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are stored to memory. Elements in the destination memory 
vector with the corresponding bit clear in k1 register retain their previous value. 


Operation 
DESTINATION IS A MEMORY OPERAND 


downSize = DownConvStoreSizeOf ¢39(SSS[2:@]) 


for(n = @ ;n < 16; n++) { 
if (k1[n] != @) { 
i = 324n 
tmp = DownConvStore ¢32(zmm1[it+31:i], SSS[2:0]) 
if(downSize == 4) { 
MemStore(m,+4*n) = tmp[31:0] 
} else if(downSize == 2) { 
MemStore(m,+2*n) = tmp[15:0] 
} else if(downSize == 1) { 
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MemStore(m:+n) = tmpl7:@] 
} 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Memory Down-conversion: D;35 


S2515o || Function: Usage disp8*N 
000 no conversion zmm1 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float32 to float16 zmm1 {float16} 32 

100 float32 to uint8 zmm1 {uint8} 16 

101 float32 to sint8 zmm1 {sint8} 16 

110 float32 to uint16 zmm1 {uint16} 32 

111 float32 to sint16 zmm1 {sint16} 32 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_storenr_ps(void*, m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
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#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMOVNRNGOAPD - Non-globally Ordered Store Aligned Float64 Vector 
With No-Read Hint 


Opcode Instruction Description 

MVEX.512.F3.0F.W1.EH1 vmovnrngoapd m {k1}, Non-ordered Store with No-Read hint float64 

29 /r D e64(zmm1) vector Dg4(zmm1) into m, under write-mask. 
Description 


Stores float64 vector zmm1 (or a down-converted version of it) into destination memory 
with a No-Read hint for the case the whole vector is going to be written into memory, 
using a weakly-ordered memory consistency model (i.e. stores performed with these in- 
struction are not globally ordered, and subsequent stores from the same thread can be 
observed before them). 


This instruction is intended to speed up the case of stores in streaming kernels where we 
want to avoid wasting memory bandwidth by being forced to read the original content of 
entire cache lines from memory when we overwrite their whole contents completely. This 
instruction takes advantage of the weakly-ordered memory consistency model to increase 
the throughput at which this type of write operations can be performed. Due to the same 
reason, a fencing operation implemented with SFENCE, MFENCE or CPUID instructions 
should be used in conjunction with this instruction if multiple threads are reading/writing 
the memory operand location (note that Knights Corner does not implement SFENCE nor 
MFENCE). 


In Knights Corner, this instruction is able to optimize memory bandwidth in case of acache 
miss and avoid reading the original contents of the memory destination operand if the 
following conditions hold true: 


¢ The instruction does not use a write-mask (MVEX.aaa=000). 
e The instruction does not perform any kind of down-conversion (MVEX.SSS=000). 


Note that this instruction is encoded by forcing MVEX.EH bit to 1. The Eviction Hint does 
not have any effect on this instruction. 


The No-Read directive is intended as a performance hint and could be ignored by a given 
processor implementation. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are stored to memory. Elements in the destination memory 
vector with the corresponding bit clear in k1 register retain their previous value. 
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Operation 


DESTINATION IS A MEMORY OPERAND 


downSize = DownConvStoreSize0f ¢g4(SSS[2:0]) 


for(n = @ ;n < 8; nt+) { 
if (k1[n] != @) { 
i = 64*n 
tmp = DownConvStore ¢g4(zmm1[it+63:i], SSS[2:0]) 
if(downSize == 8) { 
MemStore(m,+8*n) = tmp[63:0] 
} 
} 
} 


SIMD Floating-Point Exceptions 


None. 


Memory Down-conversion: D r¢4 


S55S15Spo || Function: Usage disp8*N 
000 no conversion zmm1 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_storenrngo_pd(void*, m512d); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
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Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


394 Reference Number: 327364-001 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VMOVNRNGOAPS - Non-globally Ordered Store Aligned Float32 Vector 
With No-Read Hint 


Opcode Instruction Description 

MVEX.512.F2.0F.W0.EH1 vmovnrngoaps m {k1}, Non-ordered Store with No-Read hint float32 

29 /r D32(zmm1) vector D32(zmm1) into m, under write-mask. 
Description 


Stores float32 vector zmm1 (or a down-converted version of it) into destination memory 
with a No-Read hint for the case the whole vector is going to be written into memory, 
using a weakly-ordered memory consistency model (i.e. stores performed with these in- 
struction are not globally ordered, and subsequent stores from the same thread can be 
observed before them). 


This instruction is intended to speed up the case of stores in streaming kernels where we 
want to avoid wasting memory bandwidth by being forced to read the original content of 
entire cache lines from memory when we overwrite their whole contents completely. This 
instruction takes advantage of the weakly-ordered memory consistency model to increase 
the throughput at which this type of write operations can be performed. Due to the same 
reason, a fencing operation implemented with SFENCE, MFENCE or CPUID instructions 
should be used in conjunction with this instruction if multiple threads are reading/writing 
the memory operand location (note that Knights Corner does not implement SFENCE nor 
MFENCE). 


In Knights Corner, this instruction is able to optimize memory bandwidth in case of acache 
miss and avoid reading the original contents of the memory destination operand if the 
following conditions hold true: 


¢ The instruction does not use a write-mask (MVEX.aaa=000). 
e The instruction does not perform any kind of down-conversion (MVEX.SSS=000). 


Note that this instruction is encoded by forcing MVEX.EH bit to 1. The Eviction Hint does 
not have any effect on this instruction. 


The No-Read directive is intended as a performance hint and could be ignored by a given 
processor implementation. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are stored to memory. Elements in the destination memory 
vector with the corresponding bit clear in k1 register retain their previous value. 
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Operation 
DESTINATION IS A MEMORY OPERAND 


downSize = DownConvStoreSize0f ¢32(SSS[2:0]) 


for(n = @ ;n < 16; n++) { 
if (k1[n] != @) { 
i = 324n 
tmp = DownConvStore ¢32(zmm1[it+31:i], SSS[2:0]) 
if(downSize == 4) { 
MemStore(m,+4*n) = tmp[31:0] 
} else if(downSize == 2) { 
MemStore(m,+2%n) = tmp[15:0] 
} else if(downSize == 1) { 
MemStore(m:+n) = tmpl7:0] 
} 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Memory Down-conversion: D3. 


S2515o || Function: Usage disp8*N 
000 no conversion zmm1 64 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float32 to float16 zmm1 {float16} 32 

100 float32 to uint8 zmm1 {uint8} 16 

101 float32 to sint8 zmm1 {sint8} 16 

110 float32 to uint16 zmm1 {uint16} 32 

111 float32 to sint16 zmm1 {sint16} 32 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_storenrngo_ps(void*, _m512); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMULPD - Multiply Float64 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0FW1 vmulpd zmmi1 {k1}, zmm2, Multiply float64 vector zmm2 and float64 vec- 

59 /r Sea(zmm3/mz) tor Sg4(zmm3/m;) and store the result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element multiplication between float64 vector zmm2 and the 
float64 vector result of the swizzle/broadcast/conversion process on memory or float64 
vector zmm3. The result is written into float64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 64*n 
// float64 operation 
zmm1[i+63:i] = zmm2[i+63:i] * tmpSrc3[it+63:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 564 


$2519 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_mul_pd(_m512d,_m512d); 
—m512d _mm512_mask_mul_pd (_m512d,_mmask8,_m512d,_m512d); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VMULPS - Multiply Float32 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.0F.W059/r vmulps zmm1 {ki}, zmm2, Multiply float32 vector zmm2 and float32 vec- 
S'f32(zmm3/mz) tor S'32(zmm3/m;) and store the result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element multiplication between float32 vector zmm2 and the 
float32 vector result of the swizzle/broadcast/conversion process on memory or float32 
vector zmm3. The result is written into float32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 32an 
// float32 operation 
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny 


Results To Zero: 


(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 32 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S -35 


MVEX.EH=0 

S251Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_mul_ps (_m512,__m512); 
—m512 _mm512_mask_mul_ps (_m512,_mmask16,__m512,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction Description 
MVEX.512.66.0F38.W0O D4 _ vpackstorehd me {k1}, Pack mask-enabled elements of int32 vector 
/v Di32(zmm1) zmm1 to form an unaligned int32 stream, 


down-convert it and logically map the stream 
starting at m, — 64, and store that portion 
of the stream that maps to the high 64-byte- 
aligned portion of the memory destination, un- 
der write-mask. 


Description 


Packs and down-converts the mask-enabled elements of int32 vector zmm1 into a byte/word/doubleword 
stream logically mapped starting at element-aligned address (m, — 64), and stores the 

high-64-byte elements of that stream (those elements of the stream that map at or after 

the first 64-byte-aligned address following (m, — 64), the high cache line in the current 
implementation). The length of the stream depends on the number of enabled masks, as 

elements disabled by the mask are not added to the stream. 


The vpackstoreld instruction is used to store the part of the stream before the first 64- 
byte-aligned address preceding m. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFFFF for this instruction. For that reason, 
the notation and encoding are the same as for a write-mask. 


In conjunction with vpackstoreld, this instruction is useful for packing data into a queue. 
Also in conjunction with vpackstoreld, it allows unaligned vector stores (that is, vector 
stores that are only element-wise , not vector-wise, aligned); just use a mask of OxFFFF or 
no write-mask for this purpose. The typical instruction sequence to perform an unaligned 
vector store would be: 


// assume memory location is pointed by register rax 
vpackstoreld [rax] {k1}, va 
vpackstorehd [rax+64] {k1}, vQ 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
boundary. Additionally, A/D bits in the page table will not be updated. 
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Operation 


storeOffset = Q 
downSize = DownConvStoreSize0fj39(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = mm - 64 
for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
if (foundNext64BytesBoundary == false) { 
if ( ( (pointer + (storeOffsett+1)xdownSize) % 64) == @) { 
foundNext64BytesBoundary = true 


} 
} else { 
i = 32x*n 


tmp = DownConvStorej;32(zmm1[i+31:i], SSS[2:0]) 
if(downSize == 4) { 

MemStore(pointer + storeOffset*4) = tmp[31:0] 
} else if(downSize == 2) { 

MemStore(pointer + storeOffset*2) = tmp[15:0] 
} else if(downSize == 1) { 

MemStore(pointer + storeOffset) = tmp[7:0] 
3 


I 


storeOffsett+ 


Flags Affected 


None. 


Memory Down-conversion: D,32 


$2519 || Function: Usage disp8*N 
000 no conversion zmm1 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint32 to uint8 zmm1 {uint8} 1 

101 sint32 to sint8 zmm1 {sint8} 1 

110 uint32 to uint16 zmm1 {uint16} 2 

111 sint32 to sint16 zmm1 {sint16} 2 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_extpackstorehi_epi32 (void*, _m512i, 
_MM_DOWNCONV_EPI32_ENUM, int); 
void _mm512_mask_extpackstorehi_epi32 (void*, —_mmask16, _m512i, 


_MM_DOWNCONV_EPI32_ENUM, int); 
void _mm512_packstorehi_epi32 (void*, _m512i); 


void _mm512_mask_packstorehi_epi32 (void*, _mmask16, _m512i); 
Exceptions 
Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 
Protected and Compatibility Mode 
#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the fist operand is not a memory location. 
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VPACKSTOREHPD - Pack And Store Unaligned High From Float64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 D5  vpackstorehpd me {k1}, Pack mask-enabled elements of float64 vector 
/v D e64(zmm1) zmm1 to form an unaligned float64 stream, 


down-convert it and logically map the stream 
starting at m, — 64, and store that portion 
of the stream that maps to the high 64-byte- 
aligned portion of the memory destination, un- 
der write-mask. 


Description 


Packs and down-converts the mask-enabled elements of float64 vector zmm1 into a 
float64 stream logically mapped starting at element-aligned address (m,—64), and stores 
the high-64-byte elements of that stream (those elements of the stream that map at or af- 
ter the first 64-byte-aligned address following (m,— 64), the high cache line in the current 
implementation). The length of the stream depends on the number of enabled masks, as 
elements disabled by the mask are not added to the stream. 


The vpackstorelpd instruction is used to store the part of the stream before the first 64- 
byte-aligned address preceding m. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFF for this instruction. For that reason, the 
notation and encoding are the same as for a write-mask. 


In conjunction with vpackstorelpd, this instruction is useful for packing data into a queue. 
Also in conjunction with vpackstorelpd, it allows unaligned vector stores (that is, vector 
stores that are only element-wise , not vector-wise, aligned); just use a mask of OxFF or 
no write-mask for this purpose. The typical instruction sequence to perform an unaligned 
vector store would be: 


// assume memory location is pointed by register rax 
vpackstorelpd [rax] {k1}, v@ 
vpackstorehpd [rax+64] {k1}, va 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
boundary. Additionally, A/D bits in the page table will not be updated. 


Reference Number: 327364-001 407 


> 
D 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


Operation 


storeOffset = Q 
downSize = DownConvStoreSizeOf ¢g4(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = mm - 64 
for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
if (foundNext64BytesBoundary == false) { 
if ( ( (pointer + (storeOffsett+1)xdownSize) % 64) == @) { 
foundNext64BytesBoundary = true 


} 
} else { 
i = 64*n 


tmp = DownConvStore re4(zmm1[it+63:i], SSS[2:0]) 
if(downSize == 8) { 
MemStore(pointer + storeOffset*8) = tmpl63:0] 
3 
3 


storeOffsett++ 


SIMD Floating-Point Exceptions 


None. 


Memory Down-conversion: D ¢¢4 


S2515o || Function: Usage disp8*N 
000 no conversion zmm1 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_extpackstorehi_pd (void*,__m512d,_ MM_DOWNCONV_PD_ENUM, int); 

void _mm512_mask_extpackstorehi_pd (void*, _mmaské, _m512d, 
_MM_DOWNCONV_PD_ENUM, int); 

void _mm512_packstorehi_pd (void*,_ m512d); 

void _mm512_mask_packstorehi_pd (void*,__mmask8,__m512d); 


Vee 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the fist operand is not a memory location. 
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Opcode Instruction Description 
MVEX.512.66.0F38.W0 D5 vpackstorehps me {k1}, Pack mask-enabled elements of float32 vector 
/v D 32(zmm1) zmm1 to form an unaligned float32 stream, 


down-convert it and logically map the stream 
starting at m, — 64, and store that portion 
of the stream that maps to the high 64-byte- 
aligned portion of the memory destination, un- 
der write-mask. 


Description 


Packs and down-converts the mask-enabled elements of float32 vector zmm1 into a 
byte/word/doubleword stream logically mapped starting at element-aligned address 
(m, — 64), and stores the high-64-byte elements of that stream (those elements of the 
stream that map at or after the first 64-byte-aligned address following (m,; — 64), the high 
cache line in the current implementation). The length of the stream depends on the num- 
ber of enabled masks, as elements disabled by the mask are not added to the stream. 


The vpackstorelps instruction is used to store the part of the stream before the first 64- 
byte-aligned address preceding m. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFFFF for this instruction. For that reason, 
the notation and encoding are the same as for a write-mask. 


In conjunction with vpackstorelps, this instruction is useful for packing data into a queue. 
Also in conjunction with vpackstorelps, it allows unaligned vector stores (that is, vector 
stores that are only element-wise , not vector-wise, aligned); just use a mask of OxFFFF or 
no write-mask for this purpose. The typical instruction sequence to perform an unaligned 
vector store would be: 


// assume memory location is pointed by register rax 
vpackstorelps [rax] {k1}, v@ 
vpackstorehps [rax+64] {k1}, va 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
boundary. Additionally, A/D bits in the page table will not be updated. 
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Operation 


storeOffset = Q 
downSize = DownConvStoreSizeOf ¢32(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = m - 64 
for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
if (foundNext64BytesBoundary == false) { 
if ( ( (pointer + (storeOffsett+1)xdownSize) % 64) == @) { 
foundNext64BytesBoundary = true 


} 
} else { 
i = 32x*n 


tmp = DownConvStore f32(zmm1[it+31:i], SSS[2:0]) 
if(downSize == 4) { 

MemStore(pointer + storeOffset*4) = tmp[31:0] 
} else if(downSize == 2) { 

MemStore(pointer + storeOffset*2) = tmp[15:0] 
} else if(downSize == 1) { 

MemStore(pointer + storeOffset) = tmp[7:0] 
3 

i 


storeOffsett+ 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Memory Down-conversion: D;35 


S2515o || Function: Usage disp8*N 
000 no conversion zmm1 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float32 to float16 zmm1 {float16} 2 

100 float32 to uint8 zmm1 {uint8} 1 

101 float32 to sint8 zmm1 {sint8} 1 

110 float32 to uint16 zmm1 {uint16} 2 

111 float32 to sint16 zmm1 {sint16} 2 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


void 
void 


void 
void 


_mm512_extpackstorehi_ps (void*,__m512, MM_DOWNCONV_PS_ENUM, int); 
_mm512_mask_extpackstorehi_ps (void*, _mmask16, _m512, 
_MM_DOWNCONV_PS _ENUM, int); 

_mm512_packstorehi_ps (void*, __m512); 

_mm512_mask_packstorehi_ps (void*, _mmask16,_m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#SS(0) 
#GP(0) 
#PF(fault-code) 


#NM 
#UD 


Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the DownConv. 
For a page fault. 

If CRO.TS[bit 3]=1. 

If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If the fist operand is not a memory location. 
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VPACKSTOREHQ - Pack And Store Unaligned High From Int64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 D4  vpackstorehq me {k1}, Pack mask-enabled elements of int64 vector 
/v Diga(zmm1) zmm1 to form an unaligned int64 stream, 


down-convert it and logically map the stream 
starting at m, — 64, and store that portion 
of the stream that maps to the high 64-byte- 
aligned portion of the memory destination, un- 
der write-mask. 


Description 


Packs and down-converts the mask-enabled elements of int64 vector zmm1 into a int64 
stream logically mapped starting at element-aligned address (m, — 64), and stores the 
high-64-byte elements of that stream (those elements of the stream that map at or after 
the first 64-byte-aligned address following (m; — 64), the high cache line in the current 
implementation). The length of the stream depends on the number of enabled masks, as 
elements disabled by the mask are not added to the stream. 


The vpackstorelq instruction is used to store the part of the stream before the first 64- 
byte-aligned address preceding m:. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFF for this instruction. For that reason, the 
notation and encoding are the same as for a write-mask. 


In conjunction with vpackstorelq, this instruction is useful for packing data into a queue. 
Also in conjunction with vpackstorelq, it allows unaligned vector stores (that is, vector 
stores that are only element-wise , not vector-wise, aligned); just use a mask of OxFF or 
no write-mask for this purpose. The typical instruction sequence to perform an unaligned 
vector store would be: 


// assume memory location is pointed by register rax 
vpackstorelq [rax] {k1}, va 
vpackstorehq [rax+64] {k1}, vQ 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. The instruction will not produce any #GP or 
#SS fault due to address canonicity nor #PF fault if the address is aligned to a 64-byte 
boundary. Additionally, A/D bits in the page table will not be updated. 
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Operation 


storeOffset = Q 
downSize = DownConvStoreSize0fj¢4(SSS[2:0]) 
foundNext64BytesBoundary = false 


pointer = mm - 64 
for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
if (foundNext64BytesBoundary == false) { 
if ( ( (pointer + (storeOffsett+1)xdownSize) % 64) == @) { 
foundNext64BytesBoundary = true 


} 
} else { 
i = 64*n 


tmp = DownConvStorejg4(zmm1[it+63:i], SSS[2:0]) 
if(downSize == 8) { 
MemStore(pointer + storeOffset*8) = tmp[63:0] 
} 
} 


storeOffsett++ 


Flags Affected 


None. 


Memory Down-conversion: Dj¢4 


S2515o || Function: Usage disp8*N 
000 no conversion zmm1 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_extpackstorehi_epi64 (void*, _m512i, 
_MM_DOWNCONV_EPI64_ENUM, int); 
void _mm512_mask_extpackstorehi_epi64 (void*, _mmasks, _m512i, 


_MM_DOWNCONV_EPI64_ENUM, int); 
void _mm512_packstorehi_epi64 (void*, _m512i); 
void _mm512_mask_packstorehi_epi64 (void*, _mmask8, _m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the fist operand is not a memory location. 
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Opcode Instruction Description 
MVEX.512.66.0F38.W0 DO vpackstoreld me {k1}, Pack mask-enabled elements of int32 vector 
/v Di32(zmm1) zmm1 to form an unaligned int32 stream, 


down-convert it and logically map the stream 
starting at m;, and store that portion of the 
stream that maps to the low 64-byte-aligned 
portion of the memory destination, under 
write-mask. 


Description 


Packs and down-converts the mask-enabled elements of int32 vector zmm1 into a byte/word/doubleword 
stream logically mapped starting at element-aligned address m,, and stores the low-64- 

byte elements of that stream (those elements of the stream that map before the first 64- 

byte-aligned address following m;, the low cache line in the current implementation). The 

length of the stream depends on the number of enabled masks, as elements disabled by 

the mask are not added to the stream. 


The vpackstorehd instruction is used to store the part of the stream at or after the first 
64-byte-aligned address preceding m. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFFFF for this instruction. For that reason, 
the notation and encoding are the same as for a write-mask. 


In conjunction with vpackstorehd, this instruction is useful for packing data into into a 
queue. Also in conjunction with vpackstorehd, it allows unaligned vector stores (that is, 
vector stores that are only element-wise, not vector-wise, aligned); just use a mask of 
OxFFFF or no write-mask for this purpose. The typical instruction sequence to perform 
an unaligned vector store would be: 


// assume memory location is pointed by register rax 
vpackstoreld [rax] {k1}, va 
vpackstorehd [rax+64] {k1}, vQ 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 
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Operation 


storeOffset = Q 
downSize = DownConvStoreSize0fj39(SSS[2:0]) 


for(n = @ ;n < 16; n++) { 
if (k1[n] != @) { 
1 = 324n 
tmp = DownConvStore;32(zmm1[i+31:i], SSS[2:0]) 
if(downSize == 4) { 
MemStore(m,+4xstoreOffset) = tmpl31:0] 
} else if(downSize == 2) { 
MemStore(m,+2*storeOffset) = tmpl15:0] 
} else if(downSize == 1) { 
MemStore(m,t+storeOffset) = tmp[7:0] 
} 
storeOffset++ 
if (((m, + downSizexstoreOffset) % 64) == @) { 
break 
} 
} 
} 


Flags Affected 


None. 


Memory Down-conversion: D;32 


S25159 || Function: Usage disp8*N 
000 no conversion zmm1 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint32 to uint8 zmm1 {uint8} 1 

101 sint32 to sint8 zmm1 {sint8} 1 

110 uint32 to uint16 zmm1 {uint16} 2 

111 sint32 to sint16 zmm1 {sint16} 2 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_extpackstorelo_epi32 (void*, _m512i, 
_MM_DOWNCONV_EPI32_ENUM, int); 
void _mm512_mask_extpackstorelo_epi32 (void*, —_mmask16, _m512i, 


_MM_DOWNCONV_EPI32_ENUM, int); 
void _mm512_packstorelo_epi32 (void*, _m512i); 
void _mm512_mask_packstorelo_epi32 (void*,__mmask16, __m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the fist operand is not a memory location. 
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VPACKSTORELPD - Pack and Store Unaligned Low From Float64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 D1 vpackstorelpd me {k1}, Pack mask-enabled elements of float64 vector 
/v D ea(zmm1) zmm1 to form an unaligned float64 stream, 


down-convert it and logically map the stream 
starting at m;, and store that portion of the 
stream that maps to the low 64-byte-aligned 
portion of the memory destination, under 
write-mask. 


Description 


Packs and down-converts the mask-enabled elements of float64 vector zmm1 into a 
float64 stream logically mapped starting at element-aligned address m;, and stores the 
low-64-byte elements of that stream (those elements of the stream that map before the 
first 64-byte-aligned address following m;, the low cache line in the current implemen- 
tation). The length of the stream depends on the number of enabled masks, as elements 
disabled by the mask are not added to the stream. 


The vpackstorehpd instruction is used to store the part of the stream at or after the first 
64-byte-aligned address preceding m. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFF for this instruction. For that reason, the 
notation and encoding are the same as for a write-mask. 


In conjunction with vpackstorehpd, this instruction is useful for packing data into into a 
queue. Also in conjunction with vpackstorehpd, it allows unaligned vector stores (that 
is, vector stores that are only element-wise, not vector-wise, aligned); just use a mask of 
OxFF or no write-mask for this purpose. The typical instruction sequence to perform an 
unaligned vector store would be: 


// assume memory location is pointed by register rax 
vpackstorelpd [rax] {k1}, v@ 
vpackstorehpd [raxt+64] {k1}, va 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 
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Operation 


storeOffset = Q 
downSize = DownConvStoreSize0f ¢g4(SSS[2:0]) 


for(n = @ ;n < 8; ntt) { 
if (ki[n] != @) { 
1 = 644n 
tmp = DownConvStore re4(zmm1Lit+63:i], SSS[2:0]) 
if(downSize == 8) { 
MemStore(m,+8xstoreOffset) = tmpl63:0] 
} 
storeOffsett+ 
if (C(m, + downSizexstoreOffset) % 64) == @) { 
break 
} 
} 
} 


SIMD Floating-Point Exceptions 


None. 


Memory Down-conversion: D r¢4 


S2515So || Function: Usage disp8*N 
000 no conversion zmm1 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_extpackstorelo_pd (void*,__m512d,_ MM_DOWNCONV_PD_ENUM, int); 

void _mm512_mask_extpackstorelo_pd (void*, _mmaskgé, _m512d, 
_MM_DOWNCONV_PD_ENUM, int); 

void _mm512_packstorelo_pd (void*,_m512d); 

void _mm512_mask_packstorelo_pd (void*, _mmask8,__m512d); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 
#GP(0) 
#PF(fault-code) 


#NM 
#UD 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the DownConv. 
For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the fist operand is not a memory location. 
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Opcode Instruction Description 
MVEX.512.66.0F38.W0 D1 vpackstorelps me {k1}, Pack mask-enabled elements of float32 vector 
/v D32(zmm1) zmm1 to form an unaligned float32 stream, 


down-convert it and logically map the stream 
starting at m,;, and store that portion of the 
stream that maps to the low 64-byte-aligned 
portion of the memory destination, under 
write-mask. 


Description 


Packs and down-converts the mask-enabled elements of float32 vector zmm1 into a 
byte /word/doubleword stream logically mapped starting at element-aligned address m,, 
and stores the low-64-byte elements of that stream (those elements of the stream that 
map before the first 64-byte-aligned address following m,, the low cache line in the cur- 
rent implementation). The length of the stream depends on the number of enabled masks, 
as elements disabled by the mask are not added to the stream. 


The vpackstorehps instruction is used to store the part of the stream at or after the first 
64-byte-aligned address preceding m. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFFFF for this instruction. For that reason, 
the notation and encoding are the same as for a write-mask. 


In conjunction with vpackstorehps, this instruction is useful for packing data into into a 
queue. Also in conjunction with vpackstorehps, it allows unaligned vector stores (that 
is, vector stores that are only element-wise, not vector-wise, aligned); just use a mask of 
OxFFFF or no write-mask for this purpose. The typical instruction sequence to perform 
an unaligned vector store would be: 


// assume memory location is pointed by register rax 
vpackstorelps [rax] {k1}, v@ 
vpackstorehps [rax+64] {k1}, va 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 
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storeOffset = @ 
downSize = DownConvStoreSize0f ¢32(SSS[2:0]) 


for(n = @ ;n < 16; n++) { 
if (ki[n] != @) { 
1 = 32an 
tmp = DownConvStore 32 (zmm1[it+31:i], SSS[2:0]) 
if(downSize == 4) { 
MemStore(m,+4*storeOffset) = tmpl31:0] 
} else if(downSize == 2) { 
MemStore(m,+2*storeOffset) = tmpl15:0] 
} else if(downSize == 1) { 
MemStore(m,t+storeOffset) = tmp[7:0] 
} 
storeOffsett+ 
if (C((m, + downSizexstoreOffset) % 64) == 0) { 
break 
} 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Memory Down-conversion: D;35 


S25159 || Function: Usage disp8*N 
000 no conversion zmm1 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float32 to float16 zmm1 {float16} 2 

100 float32 to uint8 zmm1 {uint8} 1 

101 float32 to sint8 zmm1 {sint8} 1 

110 float32 to uint16 zmm1 {uint16} 2 

111 float32 to sint16 zmm1 {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_extpackstorelo_ps (void*, __m512, MM_DOWNCONV_PS_ENUM, int); 

void _mm512_mask_extpackstorelo_ps (void*, —_mmask16, 
_MM_DOWNCONV_PS ENUM, int); 

void _mm512_packstorelo_ps (void*, __m512); 

void _mm512_mask_packstorelo_ps (void*,_mmask16,__m512); 


Reference Number: 327364-001 


423 


> 
D 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the fist operand is not a memory location. 
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VPACKSTORELQ - Pack and Store Unaligned Low From Int64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 DO vpackstorelq me {k1}, Pack mask-enabled elements of int64 vector 
/v Diga(zmm1) zmm1 to form an unaligned int64 stream, 


down-convert it and logically map the stream 
starting at m;, and store that portion of the 
stream that maps to the low 64-byte-aligned 
portion of the memory destination, under 
write-mask. 


Description 


Packs and down-converts the mask-enabled elements of int64 vector zmm1 into a int64 
stream logically mapped starting at element-aligned address m;,, and stores the low-64- 
byte elements of that stream (those elements of the stream that map before the first 64- 
byte-aligned address following m;, the low cache line in the current implementation). The 
length of the stream depends on the number of enabled masks, as elements disabled by 
the mask are not added to the stream. 


The vpackstorehq instruction is used to store the part of the stream at or after the first 
64-byte-aligned address preceding m. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector, choosing which elements are added to the stream. The one similarity 
to a write-mask as used in the rest of this document is that the no-write-mask option 
(encoding 0) is available to select a mask of OxFF for this instruction. For that reason, the 
notation and encoding are the same as for a write-mask. 


In conjunction with vpackstorehg, this instruction is useful for packing data into into a 
queue. Also in conjunction with vpackstorehgq, it allows unaligned vector stores (that is, 
vector stores that are only element-wise, not vector-wise, aligned); just use a mask of 
OxFF or no write-mask for this purpose. The typical instruction sequence to perform an 
unaligned vector store would be: 


// assume memory location is pointed by register rax 
vpackstorelq [rax] {k1}, va 
vpackstorehq [rax+64] {k1}, vQ 


This instruction does not have subset support. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note that the address reported by a page fault is the beggining of the 64-byte cache line 
boundary containing the memory operand. 
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Operation 


storeOffset = Q 
downSize = DownConvStoreSize0fj¢4(SSS[2:2]) 


for(n = @ ;n < 8; ntt) { 
if (ki[n] != @) { 
i = 644n 
tmp = DownConvStorejg4(zmm1[i+63:i], SSS[2:0]) 
if(downSize == 8) { 


MemStore(m,+8storeOffset) = tmp[63:2] 
} 
storeOffset++ 
if (((m, + downSizexstoreOffset) % 64) == 0) { 
break 
} 
} 
} 
Flags Affected 
None. 
Memory Down-conversion: Dj. 
S25159 || Function: Usage disp8*N 
000 no conversion zmm1 8 
001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
Intel’ C/C++ Compiler Intrinsic Equivalent 
void _mm512_extpackstorelo_epi64 (void*, _m512i, 
_MM_DOWNCONV_EPI64_ENUM, int); 
void _mm512_mask_extpackstorelo_epi64 (void*, _mmaské, _m512i, 
_MM_DOWNCONV_EPI64_ENUM, int); 
void _mm512_packstorelo_epi64 (void*, _m512i); 
void _mm512_mask_packstorelo_epi64 (void*, mmask8,_m512i); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 
#GP(0) 
#PF(fault-code) 


#NM 
#UD 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the DownConv. 
For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If the fist operand is not a memory location. 
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VPADCD - Add Int32 Vectors with Carry 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpadcd zmmi1_ {ki}, k2, Add int32 vector S;32(zmm3/m,), vector mask 
5C /r Sizo(zmm3/mz) register k2 and int32 vector zmm1 and store 


the result in zmm1, and the carry of the sum in 
k2, under write-mask. 


Description 


Performs an element-by-element three-input addition between int32 vector zmm1, the 
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vec- 
tor zmm3, and the corresponding bit of k2. The result is written into int32 vector zmm1. 


In addition, the carry from the sum for the n-th element is written into the n-th bit of 
vector mask k2. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1 
and k2 with the corresponding bit clear in k1 retain their previous value. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m,) 


t 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 324n 
// integer operation 
tmpCarry = Carry(zmm1[it31:i] + k2E£n] + tmpSrc3[i+31:i]) 
zmm1[i+31:i] = zmmi1[i+31:i] + k2[n] + tmpSrc3[it+31:i] 
k2[n] = tmpCarry 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


$2515 


Function: 


Usage 


disp8*N 


000 
001 
010 
011 
100 
101 
110 
111 


no conversion 

broadcast 1 element (x16) 
broadcast 4 elements (x4) 
reserved 

uint8 to uint32 

sint8 to sint32 

uint16 to uint32 

sint16 to sint32 


[rax] {16to16} or [rax] 
[rax] {1to16} 

[rax] {4to16} 

N/A 

[rax] {uint8} 

[rax] {sint8} 

[rax] {uint16} 

[rax] {sint16} 


64 
4 
16 
N/A 
16 
16 
32 
32 


Register Swizzle: S;35 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i 
_—m512i 


Exceptions 


_mm512_adc_epi32(_m512i,__mmask16, _m512i,__mmask16*); 


_mm512_mask_adc_epi32(_m512i, 


__mmask16*); 


Real-Address Mode and Virtual-8086 


#UD 


__mmask16, 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


Instruction not available in these modes 
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64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 
mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPADDD - Add Int32 Vectors 


Opcode Instruction Description 


MVEX.NDS.512.66.0F.W0O vpaddd zmmi_ {k1}, zmm2, Add int32 vector zmm2 and int32 vector 
Si32(zmm3/m,) and store the result in zmm1, 


FE /v Sj32(zmm3/m+,) 
under write-mask. 


Description 


Performs an element-by-element addition between int32 vector zmm2 and the int32 vec- 
tor result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. 
The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m+) 


} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32xn 
// integer operation 
zmm1[i+31:i] = zmm2[i+31:i] + tmpSrc3[it+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_add_epi32 (_m512i,__m512i); 
—m512i _mm512_mask_add_epi32 (_m512i,__mmask16,__m512i,_m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPADDSETCD - Add Int32 Vectors and Set Mask to Carry 
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Opcode Instruction Description 


MVEX.NDS.512.66.0F38.WO vpaddsetcd zmmt1_ {k1}, k2, Add int32 vector zmm1 and int32 vector 


5D /r Si32(zmm3/m,) Si32(zmm3/m,) and store the sum in zmm1 
and the carry from the sum in k2, under write- 
mask. 
Description 


Performs an element-by-element addition between int32 vector zmm1 and the int32 vec- 
tor result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. 
The result is written into int32 vector zmm1. 


In addition, the carry from the sum for the n-th element is written into the n-th bit of 
vector mask k2. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1 
and k2 with the corresponding bit clear in k1 retain their previous value. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 32&n 
// integer operation 
k2[n] = Carry(zmm1[it+31:i] + tmpSrc3[i+31:i]) 
zmm1[i+31:i] = zmm1[i+31:i] + tmpSrc3[i+31:i] 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_addsetc_epi32 (_m512i,__m512i,__mmask16*); 
—m512i _mm512_mask_addsetc_epi32 (_m512i, _mmask16,_ 
__mmask16*); 
Exceptions 


Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 
Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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mmask16, 


_m512i, 
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64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 
mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPADDSETSD - Add Int32 Vectors and Set Mask to Sign 


Opcode 
MVEX.NDS.512.66.0F38.W0 CD /r 


Instruction 
vpaddsetsd zmm1 {k1}, zmm2, 5;32(zmm3/m,) 


Description 

Add int32 ~~ vec- 
tor zmm2_ and 
int32 vector 
Sizo(zmm3/mz) 
and store the sum in 
zmm1 and the sign 
from the sum in k1, 
under write-mask. 


Description 


Performs an element-by-element addition between int32 vector zmm2 and the int32 vec- 
tor result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. 
The result is written into int32 vector zmm1. 


In addition, the sign of the result for the n-th element is written into the n-th bit of vector 


mask k1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 


} else { 


tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 


J 


for (n = @; n < 16; n++) { 


if(k1[n] != @) { 
i = 32*n 


// signed integer operation 
zmm1[i+31:i] = zmm2[i+31:i] + tmpSrc3[it+31:i] 


k1£En] = zmm1[i+31] 
} 
} 
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Flags Affected 


None. 


Memory Up-conversion: S;35 


S251So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_addsets_epi32 (_m512i,__m512i,__mmask16*); 
_—m512i _mm512_mask_addsets_epi32 (_m512i, _mmask16, _m512i, _m512i, 
__mmask16*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
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Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If any memory operand linear address is not aligned to 4-byte 
data granularity. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If no write mask is provided or selected write-mask is k0. 
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VPANDD - Bitwise AND Int32 Vectors 
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Opcode Instruction 


MVEX.NDS.512.66.0F.W0O vpandd zmmi_ {k1}, zmm2, 


DB /r Sizo(zmm3/mz) 


Description 

Perform a bitwise AND between int32 vector 
zmm_2 and int32 vector 5;32(zmm3/m;) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise AND between int32 vector zmmz2 and the int32 
vector result of the swizzle/broadcast/conversion process on memory or int32 vector 


zmm3. The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


tmpSrc3[511:0] = zmm3[511:0] 
} else { 

tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 
} 


for (n = @; n < 16; n++) { 
if(ki[n] != 0) { 
1 = 32an 
zmm1[it+31:i] = zmm2[i+31:i] & tmpSrc3[i+31:i] 
} 
} 


Flags Affected 


None. 


Reference Number: 327364-001 


= 
2 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 

Memory Up-conversion: S;35 

S2515So || Function: Usage disp8*N 

000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 

100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i 
_—m512i 


Exceptions 


_mm512_and_epi32(_m512i,_m512i); 
_mm512_mask_and_epi32(_m512i,__mmask16,__m512i, _m512i); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
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Instruction not available in these modes 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPANDND - Bitwise AND NOT Int32 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0F.W0 vpandnd zmmi1_ {k1}, zmm2, Performa bitwise AND between NOT int32 vec- 

DF /r Sizo(zmm3/mz) torzmm2 and int32 vector S;32(zmm3/m;) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise AND between NOT int32 vector zmm2 and the 
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vec- 
tor zmm3. The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != ) { 
1 = 32an 
zmm1[it+31:i] = (~Czmm2[i+31:i])) & tmpSrc3[it+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_andnot_epi32 (_m512i,__m512i); 
—m512i _mm512_mask_andnot_epi32 (_m512i,__mmask16,__m512i,__m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPANDNQ - Bitwise AND NOT Int64 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0FW1 vpandnq zmm1 {k1}, zmm2, Performa bitwise AND between NOT int64 vec- 

DF /r Siga(zmm3/mz) torzmm2 and int64 vector Sig4(zmm3/m;) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise AND between NOT int64 vector zmm2 and the 
int64 vector result of the swizzle/broadcast/conversion process on memory or int64 vec- 
tor zmm3. The result is written into int64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoadjg4 (zmm3/m;) 

} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 64*n 
zmm1[i+63:i] = (~(zmm2[i+63:i])) & tmpSrc3[i+63:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;64 

S25159 || Function: Usage disp8*N 

000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 

100 reserved N/A N/A 

101 reserved N/A N/A 

110 reserved N/A N/A 

111 reserved N/A N/A 


Register Swizzle: Sig, 


MVEX.EH=0 

S55S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i 
—m512i 


Exceptions 


_mm512_andnot_epi64(_m512i,__m512i); 


_mm512_mask_andnot_epi64(_m512i 


Real-Address Mode and Virtual-8086 


#UD 


i 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


Instruction not available in these modes 
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m512i); 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


448 Reference Number: 327364-001 


= 
=r 
(3 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VPANDQ - Bitwise AND Int64 Vectors 


Opcode Instruction 


Description 


MVEX.NDS.512.66.0FW1 vpandq zmmi1_ {k1}, zmm2, Perform a bitwise AND between int64 vector 


DB /r Siga(zmm3/mz) 


zmm_2 and int64 vector Sjg4(zmm3/m;) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise AND between int64 vector zmmz2 and the int64 
vector result of the swizzle/broadcast/conversion process on memory or int64 vector 


zmm3. The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


tmpSrc3[511:0] = zmm3[511:0] 
} else { 

tmpSrc3[511:0] = SwizzUpConvLoadjg4 (zmm3/m;) 
} 


for (n = @; n < 8; n++) { 
if(ki[n] != 0) { 
i = 644n 
zmm1[it+63:i] = zmm2[i+63:i] & tmpSrc3[i+63:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;64 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: Sig, 


MVEX.EH=0 

S55S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i 
—m512i 


Exceptions 


_mm512_and_epi64(_m512i,__m512i); 
_mm512_mask_and_epi64(_m512i,__mmask8,_m512i,__m512i); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
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Instruction not available in these modes 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPBLENDMD - Blend Int32 Vectors using the Instruction Mask 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpblendmd zmm1 {k1}, zmm2, Blend int32 vector zmm2 and int32 vector 

64 /r Sizo(zmm3/mz) Siz30(zmm3/m;) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element blending between int32 vector zmm2 and the int32 vec- 
tor result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3, 
using the instruction mask as selector. The result is written into int32 vector zmm1. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector: every element of the destination is conditionally selected between first 
source or second source using the value of the related mask bit (0 for first source, 1 for 
second source ). 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = tmpSrc3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoadj;32 (tmpSrc3/m;z) 

} 


for (n = @; n < 16; n++) { 
if(k1[n]==1 or *no write-maskx) { 
zmm1Cit+31:i] = tmpSrc3Cit+31:i] 
} else { 
zmm1[it+31:i] = zmm2[it+31:i] 


} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 

525159 || Function: Usage disp8*N 

000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 

100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i 


Exceptions 


_mm512_mask_blend_epi32 (_mmask16,__m512i,_m512i); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


Instruction not available in these modes 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPBLENDMQ - Blend Int64 Vectors using the Instruction Mask 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.W1 vpblendmq zmm1 {k1}, zmm2, Blend int64 vector zmm2 and int64 vector 

64 /r Siga(zmm3/mz) Siga(zmm3/m;) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element blending between int64 vector zmm2 and the int64 vec- 
tor result of the swizzle/broadcast/conversion process on memory or int64 vector zmm3, 
using the instruction mask as selector. The result is written into int64 vector zmm1. 


The mask is not used as a write-mask for this instruction. Instead, the mask is used as an 
element selector: every element of the destination is conditionally selected between first 
source or second source using the value of the related mask bit (0 for first source, 1 for 
second source ). 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = tmpSrc3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoadjg4 (tmpSrc3/m;z) 

} 


for (n = @; n < 8; n++) { 
if(k1[n]==1 or *no write-maskx) { 
zmm1[it63:i] = tmpSrc3lit+63:i] 
} else { 
zmm1[it63:i] = zmm2[it+63:i] 


} 


Flags Affected 


None. 
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Memory Up-conversion: S;64 


525159 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: Si¢4 


MVEX.EH=0 

S25S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_mask_blend_epi64 (_mmask8,_m512i,__m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPBROADCASTD - Broadcast Int32 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 58  vpbroadcastd zmm1 {k1}, Broadcast int32 vector Uj32(m:z) into vector 
/r Ui32(m+) zmm1, under write-mask. 

Description 


The 1, 2, or 4 bytes (depending on the conversion and broadcast in effect) at memory 
address m; are broadcast and/or converted to a int32 vector. The result is written into 


int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


// {1to16} 
tmpSrc2[31:0] = UpConvLoad;32 (mz) 
for (n = @; n < 16; n++) { 
if (ki[n] != 0) { 
i = 32xn 
zmm1[it31:i] = 
} 
} 


tmpSrc2[31:0] 


Flags Affected 


None. 


Memory Up-conversion: U3. 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 1 

101 sint8 to sint32 [rax] {sint8} 1 

110 uint16 to uint32 [rax] {uint16} 2 

111 sint16 to sint32 [rax] {sint16} 2 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_extload_epi32(void const*, MM_UPCONV_EPI32_ENUM, 
_MM_BROADCAST32_ENUM, int); 
_—m512i _mm512_mask_extload_epi32(__m512i, _mmask16, void 


const*, MM_UPCONV_EPI32_ENUM, MM_BROADCAST32_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPBROADCASTQ - Broadcast Int64 Vector 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 59  vpbroadcastq zmm1 {k1}, Broadcast int64 vector Ujg4(m:z) into vector 
/r Uiea(me) zmm1, under write-mask. 

Description 


The 8 bytes at memory address m; are broadcast to a int64 vector. The result is written 


into int64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


// {1t08} 
tmpSrc2[63:0] = UpConvLoadjg4 (mz) 
for (n = @; n < 8; n++) { 
if (k1[n] != 0) { 
i = 64x*n 
zmm1[it+63:i] = tmpSrc2[63:0] 
} 
} 


Flags Affected 


None. 


Memory Up-conversion: U4 


525159 || Function: Usage disp8*N 
000 no conversion [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_extload_epi64(void const*, MM_UPCONV_EPI64_ENUM, 
_MM_BROADCAST64_ENUM, int); 
_—m512i _mm512_mask_extload_epi64(__m512i, —_mmask16, void 


const*, MM_UPCONV_EPI64_ENUM, MM_BROADCAST64_ENUM, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPCMPD - Compare Int32 Vectors and Set Vector Mask 
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Opcode 
MVEX.NDS.512.66.0F3A.W0 1F /r ib 


Instruction 


vpcmpd k2 {k1}, zmm1, S;32(zmm2/m,), imm8 


Description 
Compare be- 
tween int32 
vector zmm1 
and int32 vector 
Si32(zmm2/mz) 
and store the re- 
sult in k2, under 
write-mask. 


Description 


Performs an element-by-element comparison between int32 vector zmm1 and the int32 
vector result of the swizzle/broadcast/conversion from memory or int32 vector zmmz2. 
The result is written into vector mask k2. 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 


Operation 


switch (IMM8[2: 
case @: OP 
case 1: OP 
case 2: OP 
case 4: OP 
case 5: OP 


Immediate Format 


Comparison Type In IT, Ip 

eq | Equal 0 0 O 
It | Less than 0 oO 1 
le | Less than or Equal 0 1 O 
neq | Not Equal 1 oO O 
nlt | Not Less than 1 oO 1 
nle | NotLessthanorEqual | 1 1 0 


Q]) { 

< EQ; break; 
«< LT; break; 
< LE; break; 
< NEQ; break; 
< NLT; break; 
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case 6: OP «+ NLE; break; 
default: Reserved; break; 


t 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32 (zmm2/m+) 


} 
for (n = @; n < 16; n++) { 
k2[n] = @ 
if(k1[n] != @) { 
i = 32x*n 


// signed integer operation 
k2[n] = (zmm1[i+31:i] OP tmpSrc2[it+31:i]) ? 1: @ 
} 
} 


Instruction Pseudo-ops 


Compilers and assemblers may implement the following pseudo-ops in addition to the 
standard instruction op: 


Pseudo-Op Implementation 

vpcmpeqd k2 {k1}, zmm1, S;(zmm2/m) vempd k2 {k1}, zmm1, S;(zmm2/m,), {eq} 

vpcmpltd k2 {k1}, zmm1, S;(zmm2/m;) vempd k2 {k1}, zmm1, S;(zmm2/mz), {It} 

vpcmpled k2 {k1}, zmm1, S;(zmm2/m;) vempd k2 {k1}, zmm1, S;(zmm2/mz), {le} 

vpcmpneqd k2 {k1}, zmm1, S;(zmm2/m) vempd k2 {k1}, zmm1, S;(zmm2/m,), {neq} 

vpcmpnitd k2 {k1}, zmm1, S;(zmm2/m:) vempd k2 {k1}, zmm1, S;(zmm2/mz), {nlt} 

vpcmpnled k2 {k1}, zmm1, 5;(zmm2/m;) vempd k2 {k1}, zmm1, S;(zmm2/mz,), {nle} 
Flags Affected 

None. 
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Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—_mmask16 
_mmask16 _mm512_mask_cmp_epi32_mask(_mmask16, 
_MM_CMPINT_ENUM); 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


_mm512_cmp_epi32_mask(_m512i,__m512i, const_MM_CMPINT_ENUM); 
_m512i, 


_m512i, 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 
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Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPCMPEQD - Compare Equal Int32 Vectors and Set Vector Mask 
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Opcode Instruction 
MVEX.NDS.512.66.0F.W0 76 /r vpcmpeqd k2 {k1}, zmm1, Sj32(zmm2/m) 


Description 
Compare Equal be- 
tween int32 vector 


zmmi1 and int32 vector 
Sizo(zmm2/m;,), and set 
vector mask k2 to reflect 
the zero/non-zero status of 
each element of the result, 
under write-mask. 


Description 


Performs an element-by-element compare for equality between int32 vector zmm1 and 
the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector 
zmmz2. The result is written into vector mask k2. 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32 (zmm2/m+) 


}; 
for (n = @; n < 16; n++) { 
k2[n] = @ 
if(k1[n] != @) { 
i = 32x*n 


// signed integer operation 
k2Cn] = (zmm1[i+31:i] == tmpSrc2[it+31:i]) ? 1: @ 
} 
} 
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Flags Affected 


None. 


Memory Up-conversion: S;35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S255 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_cmpeq_epi32_mask (_m512i,__m512i); 
_mmask16 _mm512_mask_cmpeq_epi32_mask (_mmask16, _m512i,__m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
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#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


468 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPCMPGTD - Compare Greater Than Int32 Vectors and Set Vector Mask 


Opcode Instruction Description 

MVEX.NDS.512.66.0F.W0 66 /r vpcmpgtd k2 {k1}, zmm1, S;32(zmm2/m,) Compare Greater between 
int32 vector zmm1 and int32 
vector S;32(zmm2/m,), and 
set vector mask k2 to reflect 
the zero/non-zero status of 
each element of the result, 
under write-mask. 


Description 


Performs an element-by-element compare for the greater value of int32 vector zmm1 and 
the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector 
zmm2. The result is written into vector mask k2. 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32(zmm2/m+) 


} 
for (n = @; n < 16; n++) { 
k2[n] = @ 
if(k1[n] != @) { 
i = 32x*n 


// signed integer operation 
k2Cn] = (zmm1[i+31:i] > tmpSrc2[i+31:i]) ? 1: 0 
} 
} 
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Flags Affected 


None. 


Memory Up-conversion: S;35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S255 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_cmpgt_epi32_mask (_m512i,__m512i); 
_mmask16 _mm512_mask_cmpgt_epi32_mask (_mmask16,_m512i,_m512i); 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


470 


Reference Number: 327364-001 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPCMPLTD - Compare Less Than Int32 Vectors and Set Vector Mask 
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Opcode Instruction 
MVEX.NDS.512.66.0F38.W0 74 /r_ vpcmpltd k2 {k1}, zmm1, Sj32(zmm2/m) 


Description 

Compare Less be- 
tween int32 vector 
zmmi1 and int32 vector 
Sizo(zmm2/m;,), and set 
vector mask k2 to reflect 
the zero/non-zero status 
of each element of the 
result, under write-mask. 


Description 


Performs an element-by-element compare for the lesser value of int32 vector zmm1 and 
the int32 vector result of the swizzle/broadcast/conversion from memory or int32 vector 
zmmz. The result is written into vector mask k2. 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32 (zmm2/m+) 


}; 
for (n = @; n < 16; n++) { 
k2[n] = @ 
if(k1[n] != @) { 
i = 32x*n 


// signed integer operation 
k2En] = (zmm1[i+31:i] < tmpSrc2[i+31:i]) ? 1: @ 
} 
} 
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Flags Affected 


None. 


Memory Up-conversion: S;35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S255 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_cmplt_epi32_mask (_m512i,__m512i); 
_—mmask16 _mm512_mask_cmplt_epi32_mask (_mmask16,__m512i,__m512i); 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 
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#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPCMPUD - Compare Uint32 Vectors and Set Vector Mask 


Opcode 


Instruction 


MVEX.NDS.512.66.0F3A.WO 1E /rib vpcmpud k2 {k1}, zmm1, S;32(zmm2/m;,), imm8 


Description 
Compare 
between 

uint32 vec- 
tor zmm1 and 
uint32 vector 
Sizo(zmm2/m:z) 
and store the re- 
sult in k2, under 
write-mask. 


Description 


Performs an element-by-element comparison between uint32 vector zmm1 and the 
uint32 vector result of the swizzle/broadcast/conversion from memory or uint32 vector 
zmmz2. The result is written into vector mask k2. 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 


Immediate Format 


Operation 


switch (IMM8[2: 
case @: OP 
case 1: OP 
case 2: OP 
case 4: OP 


Reference Number: 327364-001 


Comparison Type Ig IT, Ip 
eq | Equal 0 0 O 
It | Less than 0 oO 1 
le | Less than or Equal 0 1 O 
neq | Not Equal 1 0 0O 
nlt | Not Less than 1 oO 1 
nle | NotLessthanorEqual | 1 1 0 
01) { 
+ EQ; break; 
+ LT; break; 
+ LE; break; 
< NEQ; break; 
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case 5: OP + NLT; break; 
case 6: OP < WNLE; break; 
default: Reserved; break; 


} 


if(source is a register operand and MVEX.EH bit is 1) { 


tmpSrc2[511:0] = zmm2[511:0] 
} else { 


tmpSrc2[511:0] = SwizzUpConvLoad;32 (zmm2/m;) 


} 
for (n = @; n < 16; n++) { 
k2[n] = @ 
if(k1[n] != @) { 
i = 32x*n 


// unsigned integer operation 


k2Cn] = (zmm1[i+31:i] OP tmpSrc2[it+31:i]) ? 1: @ 


} 
t 


Instruction Pseudo-ops 


Compilers and assemblers may implement the following pseudo-ops in addition to the 


standard instruction op: 


Pseudo-Op 


Implementation 


vpcmpequd k2 {k1}, zmm1, S;(zmm2/m,) 


vempud k2 {k1}, zmm1, S;(zmm2/mz), {eq} 


vpcmpltud k2 {k1}, zmm1, S;(zmm2/m:) 


vempud k2 {k1}, zmm1, S;(zmm2/mz«), {It} 


vpcmpleud k2 {k1}, zmm1, S;(zmm2/m:) 


vempud k2 {k1}, zmm1, S;(zmm2/mz,), {le} 


vpcmpnequd k2 {k1}, zmm1, S;(zmm2/m,) 


vempud k2 {k1}, zmm1, S;(zmm2/mz,), {neq} 


vpcmpniltud k2 {k1}, zmm1, S;(zmm2/m:) 


vempud k2 {k1}, zmm1, S;(zmm2/mz), {nlt} 


vpcmpnleud k2 {k1}, zmm1, S;(zmm2/mz) 


vempud k2 {k1}, zmm1, S;(zmm2/mz,), {nle} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_cmp_epi32_mask(_m512i,__m512i, const_MM_CMPINT_ENUM); 
_mmask16 _mm512_mask_cmp_epi32_mask(_mmask16, _m512i, _m512i, const 
_MM_CMPINT_ENUM); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 

#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 
mode. 

#PF(fault-code) For a page fault. 

#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPERMD - Permutes Int32 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpermd zmm1 {k1}, zmm2, Element permute vector zmm3/mt using vec- 

36 /r zmm3/mt tor zmm2 as indices and store the final result in 
zmm1, under write-mask. 


Description 


Performs an element permutation of elements from int32 vector read from memory or 
vector zmm3, using int32 vector zmm2 element as source indices. The result of the per- 
mutation is written into int32 vector zmm1. Note that index values may be repeated so 
that the same vector source element can be replicated across many vector destination 
elements. 


No swizzle, broadcast, or conversion is performed by this instruction. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


src[511:0] = zmm3/m; 


for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
1 = 32&n 
j = zmm2[i+3:i]*32 
zmm1[it+31:i] = srcljt+31:j] 
} 
} 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_permutevar_epi32 (_m512i,__m512i); 
—m512i _mm512_mask_permutevar_epi32 (_m512i,_ mmask16,__m512i,_m512i); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 
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VPERMF32X4 - Shuffle Vector Dqwords 


Opcode Instruction 


MVEX.512.66.0F3A.W0 07 /rib) vpermf32x4 zmm1 {k1}, zmm2/m,, imm8 


Description 

4xFloat32 shuffle element 
vector zmm2/m; and store 
the result in zmm1, using 
imm68 , under write-mask. 


Description 


Shuffles 128-bit blocks of the vector read from memory or vector zmm2/mem using index 
bits in immediate. The result of the shuffle is written into vector zmm1. 


No swizzle, broadcast, or conversion is performed by this instruction. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


| imm8 | 128-bit level permutation vector {3210} | I; 


Is | Is 


i, | 23 


bl | 


Operation 


src[511:0] = zmm2/m, 


// Inter-lane shuffle 

for (n = @; n < 16/4; nt++) { 
i = 128n 
j = 128*((perm128 >> 2*n) & @x3) 
tmplit127:i] = src[j+127:j] 

} 


// Writemasking 
for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
zmm1[i+31:i] = tmplit+31:i] 
} 
} 
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Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_permute4f128_epi32 (_m512i,_MM_PERM_ENUM); 

_—m512i _mm512_mask_permute4f128_epi32 (_m512i, —_mmask16, _m512i, 
_MM_PERM_ENUM); 

_—m512 _mm512_permute4f128_ps (_m512,_MM_PERM_ENUM); 

_—m512 _mm512_mask_permute4f128_ps (_m512, —_mmask16, _m512, 
_MM_PERM_ENUM); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 
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VPGATHERDD - Gather Int32 Vector With Signed Dword Indices 


Opcode Instruction Description 

MVEX.512.66.0F38.WO 90 vpgatherdd zmmz1 {k1}, Ui32(mv;z) Gather int32 vector Uizo(mv;z) into int32 vec- 

/r /vsib tor zmm1 using doubleword indices and k1 as 
completion mask. 


Description 


A set of 16 memory locations pointed by base address BAS E_ADDR and doubleword 
index vector VIN DEX with scale SCALE are converted to a int32 vector. The result is 
written into int32 vector zmm1. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Note that accessed element by will always access 64 bytes of memory. The memory region 
accessed by each element will always be between elemen_linear_address & (~0x3F) and 
(element_linear_address & (~0x3F)) + 63 boundaries. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully loaded. 


The instruction will #GP fault if the destination vector zmm1 is the same as index vector 
VINDEX. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (ktemp[n] != @) { 
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1 = 32an 

// mu,zLn] = BASE_ADDR + SignExtend(VINDEXLi+31:i] * SCALE) 
pointer[63:0] = mu;Ln] 

zmm1[i+31:i] = UpConvLoad;32 (pointer) 

k1[n] = @ 


Flags Affected 


None. 


Memory Up-conversion: U,3. 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 1 

101 sint8 to sint32 [rax] {sint8} 1 

110 uint16 to uint32 [rax] {uint16} 2 

111 sint16 to sint32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_i32gather_epi32 (_m512i, void const’, int); 

—m512i _mm512_mask_i32gather_epi32 (_m512i, _mmask16, _m512i, void const*, 
int); 

—m512i _mm512_i32extgather_epi32 (_m512i, void const*,_ MM_UPCONV_EPI32_ENUM, 
int, int); 

—m512i _mm512_mask_i32extgather_epi32 (_m512i,__mmask16, _m512i, void const*, 
_MM_UPCONV_EPI32_ENUM, int, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


484 Reference Number: 327364-001 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form, and corresponding write-mask bit is not zero. 
#GP(0) If a memory address is in a non-canonical form, 


and corresponding write-mask bit is not zero. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the UpConv 
and corresponding write-mask bit is not zero. 

If the destination vector is the same as the index vector [see 


#PF(fault-code) If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 
If no write mask is provided or selected write-mask is k0. 
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VPGATHERDQ - Gather Int64 Vector With Signed Dword Indices 
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Opcode Instruction Description 
MVEX.512.66.0F38.W1 90 vpgatherdqzmm1 {k1}, Uig4(mv;z) Gather int64 vector Uiga(mv;z) into int64 vec- 
/r /vsib tor zmm1 using doubleword indices and k1 as 


completion mask. 


Description 


A set of 8 memory locations pointed by base address BASE_ADDR and doubleword 
index vector VIN DEX with scale SCALE are converted to a int64 vector. The result is 
written into int64 vector zmm1. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Note that accessed element by will always access 64 bytes of memory. The memory region 
accessed by each element will always be between elemen_linear_address & (~0x3F) and 
(element_linear_address & (~0x3F)) + 63 boundaries. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully loaded. 


The instruction will #GP fault if the destination vector zmm1 is the same as index vector 
VINDEX. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mu; as vector memory operand (VSIB) 
for (n = @; n < 8; n++) { 


if (ktemp[n] != @) { 
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64*n 

J 32*n 

// mu,zLn] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE) 
pointer[63:0] = mvu;,L[n] 
zmm1[i+63:i] = UpConvLoadjg4 (pointer) 
k1[n] = @ 

} 


I 
k1[15:8] = 0 


Il 


i 


ll 


Flags Affected 


None. 


Memory Up-conversion: U4 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_i32logather_epi64 (_m512i, void const*, int); 

—m512i _mm512_mask_i32logather_epi64 (_m512i, _mmask8, _m512i, void const*, 
int); 

_—m512i _mm512_i32loextgather_epi64 (_m512i, void const*, 
_MM_UPCONV_EPI64_ENUM, int, int); 

—m512i _mm512_mask_i32loextgather_epi64 (_m512i,__mmask8, _m512i, void const*, 
_MM_UPCONV_EPI64_ENUM, int, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
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Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form, and corresponding write-mask bit is not zero. 
#GP(0) If a memory address is in a non-canonical form, 


and corresponding write-mask bit is not zero. 

If a memory operand linear address is not aligned 

to element-wise data granularity dictated by the UpConv 
and corresponding write-mask bit is not zero. 

If the destination vector is the same as the index vector [see 


#PF(fault-code) If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 
If no write mask is provided or selected write-mask is k0. 
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VPMADD231D - Multiply First Source By Second Source and Add To Des- 


tination Int32 Vectors 


Opcode Instruction 


MVEX.NDS.512.66.0F38.W0 B5 /r vpmadd231d zmm1 {k1}, zmm2, $;32(zmm3/mz) 


Description 
Multiply int32 
vector zmm2 
and int32 vector 
Sizo(zmm3/mz), 
add the result 
to int32 vector 
zmmi1, and store 
the final result 
in zmmi, under 
write-mask. 


Description 


Performs an element-by-element multiplication between int32 vector zmm2 and the 
int32 vector result of the swizzle/broadcast/conversion process on memory or vector 
int32 zmm3, then adds the result to int32 vector zmm1. The final sum is written into 
int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 32an 
// integer operation 
zmm1[it+31:i] = zmm2[i+31:i] * tmpSrc3[i+31:i] + zmm1[it+31:i] 
} 
} 
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Flags Affected 


None. 


Memory Up-conversion: S;35 


S251So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_fmadd_epi32 (_m512i,__m512i,_m512i); 
—m512i _mm512_mask_fmadd_epi32 (_m512i,__mmask16,__m512i,_m512i); 
—m512i _mm512_mask3_fmadd_epi32 (_m512i,__m512i,__m512i,__mmask16); 


i, 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
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Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPMADD233D - Multiply First Source By Specially Swizzled Second Source 
and Add To Second Source Int32 Vectors 


492 


Opcode 
MVEX.NDS.512.66.0F38.W0 B4 /r 


Instruction 
vpmadd233d zmm1 {k1}, zmm2, S;30(zmm3/m:) 


Description 
Multiply int32 
vector zmm2_ by 
certain elements 
of int32 vector 
Si32(zmm3/mz:), 
add the re- 
sult to certain 
elements of 
Sizo(zmm3/mz), 
and store the final 
result in zmm1, 
under write-mask. 


Description 


This instruction is built around the concept of 4-element sets, of which there are four: 
elements 0-3, 4-7, 8-11, and 12-15. If we refer to the int32 vector result of the broadcast 
(no conversion is supported) process on memory or the int32 vector zmm3 (no swizzle 
is supported) as t3, then: 


Each element 0-3 of int32 vector zmm2 is multiplied by element 1 of t3, the result is added 
to element 0 of t3, and the final sum is written into the corresponding element 0-3 of int32 
vector zmm1. 


Each element 4-7 of int32 vector zmm2 is multiplied by element 5 of t3, the result is added 
to element 4 of t3, and the final sum is written into the corresponding element 4-7 of int32 
vector zmm1. 


Each element 8-11 of int32 vector zmm2 is multiplied by element 9 of t3, the result is 
added to element 8 of t3, and the final sum is written into the corresponding element 
8-11 of int32 vector zmm1. 


Each element 12-15 of int32 vector zmm2 is multiplied by element 13 of t3, the result is 
added to element 12 of t3, and the final sum is written into the corresponding element 
12-15 of int32 vector zmm1. 


This instruction makes it possible to perform scale and bias in a single instruction without 
needing to have either scale or bias already loaded in a register. This saves one vector load 
for each interpolant, representing around ten percent of shader instructions. 


For structure-of-arrays (SOA) operation, this instruction is intended to be used with the 
{4to16} broadcast on src2, allowing all 16 scale and biases to be identical. For array-of- 
structures (AOS) vec4 operations, no broadcast is used, allowing four different scales and 
biases, one for each vec4. 
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No conversion or swizzling is supported for this instruction. However, all broadcasts ex- 
cept {1to16} are supported (i.e. 16to16 and 4to16). 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m,) 

} 


for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
i = 32xn 
base = ( n & ~@x@3 ) * 32 
scale[31:0] = tmpSrc3[baset+63:base+32] 
bias[31:0] = tmpSrc3[base+31:base] 
// integer operation 
zmm1[i+31:i] = zmm2[i+31:i] * scale[31:0] + bias[31:0] 


Flags Affected 


None. 


Memory Up-conversion: S;3. 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 reserved N/A N/A 
010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 
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Register Swizzle: S;35 


MVEX.EH=0 
S_5159 || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 reserved N/A 
010 reserved N/A 
011 reserved N/A 
100 reserved N/A 
101 reserved N/A 
110 reserved N/A 
111 reserved N/A 
Intel’ C/C++ Compiler Intrinsic Equivalent 
—m512i _mm512_fmadd233_epi32 (_m512i,_m512i); 
—m512i _mm512_mask_fmadd233_epi32 (_m512i, _mmask16, _m512i,__m512i); 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to 16 or 64-byte (depending on the swizzle broadcast). 
For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv involving data conversion, register swizzling or 
{1to16} broadcast. If SwizzUpConv function is set to any 
value different than "no action" or {4to16} then 
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Reference Number: 327364-001 


an Invalid Opcode fault is raised 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpmaxsd zmm1 {k1}, zmm2, Determine the maximum of int32 vector zmm2 
3D /r Sizo(zmm3/mz) and int32 vector S;32(zmm3/m,) and store the 


result in zmm1, under write-mask. 


Description 


Determines the maximum value of each pair of corresponding elements in int32 vector 
zmm_2 and the int32 vector result of the swizzle/broadcast/conversion process on mem- 
ory or int32 vector zmm3. The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;39 (zmm3/m,) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != ) { 
1 = 32&n 
// signed integer operation 
zmm1[i+31:i] = IMax(zmm2[i+31:i] , tmpSrc3[i+31:i]) 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_max_epi32 (_m512i,_m512i); 
—m512i _mm512_mask_max_epi32 (_m512i,_ mmask16,_m512i, 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPMAXUD - Maximum of Uint32 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpmaxud zmm1 {k1}, zmm2, Determine the maximum of uint32 vector 
3F /r Sizo(zmm3/mz) zmmz2 and uint32 vector S;32(zmm3/m:) and 


store the result in zmm1, under write-mask. 


Description 


Determines the maximum value of each pair of corresponding elements in uint32 vec- 
tor zmm2 and the uint32 vector result of the swizzle/broadcast/conversion process on 
memory or uint32 vector zmm3. The result is written into uint32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;39 (zmm3/m,) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// unsigned integer operation 
zmm1[i+31:i] = UMax(zmm2[i+31:i] , tmpSrc3[i+31:i]) 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_max_epu32 (_m512i,_m512i); 
—m512i _mm512_mask_max_epu32 (_m512i,_mmask16,__m512i,_m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPMINSD - Minimum of Int32 Vectors 


502 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpminsd zmm1 {k1}, zmm2, Determine the minimum of int32 vector zmm2 

39 /r Sizo(zmm3/mz) and int32 vector S;32(zmm3/m,) and store the 
result in zmm1, under write-mask. 


Description 


Determines the minimum value of each pair of corresponding elements in int32 vector 
zmm_2 and the int32 vector result of the swizzle/broadcast/conversion process on mem- 
ory or int32 vector zmm3. The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m+) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// signed integer operation 
zmm1[it+31:i] = (zmm2[i+31:i] < tmpSrc3[i+31:i]) ? 
zmm2([it+31:i] : tmpSrc3[it+31:i] 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_min_epi32 (_m512i,__m512i); 
—m512i _mm512_mask_min_epi32 (_m512i,__mmask16,_m512i, 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPMINUD - Minimum of Uint32 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpminud zmm1 {k1}, zmm2, Determine the minimum of uint32 vector 
3B /r Sizo(zmm3/mz) zmmz2 and uint32 vector S;32(zmm3/m:) and 


store the result in zmm1, under write-mask. 


Description 


Determines the minimum value of each pair of corresponding elements in uint32 vector 
zmm_2 and the uint32 vector result of the swizzle/broadcast/conversion process on mem- 
ory or uint32 vector zmm3. The result is written into uint32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;39 (zmm3/m,) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// unsigned integer operation 
zmm1[i+31:i] = UMin(zmm2[i+31:i] , tmpSrc3[it+31:i]) 
} 
} 


Flags Affected 


None. 


Reference Number: 327364-001 505 


> 
D 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_min_epu32 (_m512i,__m512i); 
—m512i _mm512_mask_min_epu32 (_m512i,__mmask16,__m512i,__m512i); 


,—— 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPMULHD - Multiply Int32 Vectors And Store High Result 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpmulhd zmm1 {k1}, zmm2, Multiply int32 vector zmm2 and int32 vector 

87 /r Sizo(zmm3/mz) Si32(zmm3/m;,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element multiplication between int32 vector zmm2 and the 
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vec- 
tor zmm3. The high 32 bits of the result are written into int32 zmm1 vector. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

3 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32an 
// signed integer operation 
tmp[63:0] = zmm2[i+31:i] * tmpSrc3[i+31:i] 
zmm1[it+31:i] = tmpl63:32] 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


m512i,__m512i); 


i— 


—m512i _mm512_mulhi_epi32 (_m512i,_m512i); 
—m512i _mm512_mask_mulhi_epi32 (_m512i,__mmask16 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
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Instruction not available in these modes 


Instruction not available in these modes 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPMULHUD - Multiply Uint32 Vectors And Store High Result 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpmulhud zmm1 {k1}, zmm2, Multiply uint32 vector zmm2 and uint32 vector 

86 /r Sizo(zmm3/mz) Sizo(zmm3/m;,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element multiplication between uint32 vector zmm2 and the 
uint32 vector result of the swizzle/broadcast/conversion process on memory or uint32 
vector zmm3. The high 32 bits of the result are written into uint32 zmm1 vector. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

3 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32an 
// unsigned integer operation 
tmp[63:0] = zmm2[i+31:i] * tmpSrc3[it+31:i] 
zmm1[it+31:i] = tmpl63:32] 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_mulhi_epu32 (_m512i,__m512i); 
—m512i _mm512_mask_mulhi_epu32 (_m512i,__mmask16, _m512i,_m512i); 


| 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPMULLD - Multiply Int32 Vectors And Store Low Result 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpmulld zmm1 {k1}, zmm2, Multiply int32 vector zmm2 and int32 vector 

40 /r Si32(zmm3/m;) Si32(zmm3/m,) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element multiplication between int32 vector zmm2 and the 
int32 vector result of the swizzle/broadcast/conversion process on memory or int32 vec- 
tor zmm3, and the low 32 bits of the result are written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32x*n 
// signed integer operation 
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[it+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i 
—m512i 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


Reference Number: 327364-001 


_mm512_mullo_epi32 (_m512i,__m512i); 
_mm512_mask_mullo_epi32 (_m512i,__mmask16,__m512i,__m512i); 


Instruction not available in these modes 


Instruction not available in these modes 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPORD - Bitwise OR Int32 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0F.W0O vpord zmmi_ {ki}, zmm2, Perform a bitwise OR between int32 vector 

EB /r Sizo(zmm3/mz) zmm_2 and int32 vector 5;32(zmm3/m;) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise OR between int32 vector zmm2 and the int32 
vector result of the swizzle/broadcast/conversion process on memory or int32 vector 
zmm3. The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != 0) { 
1 = 324n 
zmm1[it+31:i] = zmm2[i+31:i] | tmpSrc3[i+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_or_epi32 (_m512i,__m512i); 
—m512i _mm512_mask_or_epi32 (_m512i,_mmask16,__m512i,__m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


519 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


(intel. 


VPORQ - Bitwise OR Int64 Vectors 
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Opcode Instruction 


MVEX.NDS.512.66.0FW1 vporq zmm1_ {ki}, zmm2, 


EB /r Siga(zmm3/mz) 


Description 

Perform a bitwise OR between int64 vector 
zmm_2 and int64 vector Sjg4(zmm3/m;) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise OR between int64 vector zmmz2 and the int64 
vector result of the swizzle/broadcast/conversion process on memory or int64 vector 


zmm3. The result is written into int64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


tmpSrc3[511:0] = zmm3[511:0] 
} else { 

tmpSrc3[511:0] = SwizzUpConvLoadjg4 (zmm3/m;) 
} 


for (n = @; n < 8; n++) { 
if(ki[n] != 0) { 
i = 644n 
zmm1[it+63:i] = zmm2[i+63:i] | tmpSrc3[i+63:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;64 

S2515So || Function: Usage disp8*N 

000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 

100 reserved N/A N/A 

101 reserved N/A N/A 

110 reserved N/A N/A 

111 reserved N/A N/A 


Register Swizzle: Si¢4 


MVEX.EH=0 

S2S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i 
_—m512i 


Exceptions 


_mm512_or_epi64 (_m512i,_m512i); 
_mm512_mask_or_epi64 (_m512i,_mmask8,__m512i,__m512i); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
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Instruction not available in these modes 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSBBD - Subtract Int32 Vectors with Borrow 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpsbbd zmmt1 _ {ki}, k2, Subtract int32 vector S;32(zmm3/m,) and vec- 
5E /r Sizo(zmm3/mz) tor mask register k2 from int32 vector zmm1 


and store the result in zmm1, and the borrow 
of the subtraction in k2, under write-mask. 


Description 


Performs an element-by-element three-input subtraction of the int32 vector result of the 
swizzle/broadcast/conversion process on memory or int32 vector zmm3, as well as the 
corresponding bit of k2, from int32 vector zmm1. The result is written into int32 vector 
zmm1. 


In addition, the borrow from the subtraction difference for the n-th element is written 
into the n-th bit of vector mask k2. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1 
and k2 with the corresponding bit clear in k1 retain their previous value. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m+) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// integer operation 
tmpBorrow = Borrow(zmm1[it+31:i] - k2[En] - tmpSrc3[it+31:i]) 
zmm1[i+31:i] = zmmi1[i+31:i] - k2[n] - tmpSrc3[it+31:i] 
k2[n] = tmpBorrow 
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Flags Affected 


None. 


Memory Up-conversion: S;35 


S251So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S25S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_sbb_epi32 (_m512i,__mmask16,__m512i,__mmask16*); 
—m512i _mm512_mask_sbb_epi32 (_m512i, _mmask16, _mmask16, 
__mmask16*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


_m512i, 
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Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
#UD 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSBBRD - Reverse Subtract Int32 Vectors with Borrow 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpsbbrd zmm1_ {ki}, k2, Subtract int32 vector zmm1 and vector mask 
6E /r Sizo(zmm3/mz) register k2 from int32 vector S;30(zmm3/mz), 


and store the result in zmm1, and the borrow 
of the subtraction in k2, under write-mask. 


Description 


Performs an element-by-element three-input subtraction of int32 vector zmm1, as well as 
the corresponding bit of k2, from the int32 vector result of the swizzle/broadcast/conversion 
process on memory or int32 vector zmm3. The result is written into int32 vector zmm1. 


In addition, the borrow from the subtraction for the n-th element is written into the n-th 
bit of vector mask k2. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1 
and k2 with the corresponding bit clear in k1 retain their previous value. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m,) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32an 
// integer operation 
tmpBorrow = Borrow(tmpSrc3Lit31:i] - k2£n] - zmm1[it+31:i]) 
zmm1[i+31:i] = tmpSrc3[i+31:i] - k2[En] - zmm1[i+31:i] 
k2[n] = tmpBorrow 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16, 


_m512i, 


—m512i _mm512_sbbr_epi32 (_m512i,_mmask16,_m512i,__mmask16*); 
—m512i _mm512_mask_sbbr_epi32 (_m512i, _mmask16, 
__mmask16*); 
Exceptions 


Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 
Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 
mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSCATTERDD - Scatter Int32 Vector With Signed Dword Indices 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 AO vpscatterdd Mut {k1}, Scatter int32 vector Dj32(zmm1) to vector 
/r /vsib Dj32(zmm1) memory locations mv, using doubleword in- 


dices and k1 as completion mask. 


Description 


Down-converts and stores all 16 elements in int32 vector UNDEF to the memory locations 
pointed by base address BAS E_ADDR and doubleword index vector V[N DEX, with 
scale SCALE. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Writes to overlapping destination memory locations are guaranteed to be ordered with 
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping 
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB 
of the source registers). Writes that are not overlapped may happen in any order. Memory 
ordering with other instructions follows the Intel-64 memory ordering model. Note that 
this does not account for non-overlapping indices that map into the same physical address 
locations. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully stored. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
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if (ktemp[n] != 0) { 

1 = 324n 

// mu,Ln] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE) 

pointer[63:0] = mv;,L[n] 

tmp = DownConvStore;39(UNDEFLi+31:i], SSS[2:0]) 

if (DownConvStoreSizeOfj32(SSS[2:0]) == 4) { 
MemStore(pointer) = tmp[31:0] 

} else if (DownConvStoreSizeOfj32(SSS[2:0]) == 2) { 
MemStore(pointer) = tmp[15:0] 

} else if (DownConvStoreSizeOfj32(SSS[2:0]) == 1) { 
MemStore(pointer) = tmp[7:2] 

} 

k1[n] = @ 


Flags Affected 


None. 


Memory Down-conversion: D;32 


S251So || Function: Usage disp8*N 
000 no conversion zmm1 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 uint32 to uint8 zmm1 {uint8} 1 

101 sint32 to sint8 zmm1 {sint8} 1 

110 uint32 to uint16 zmm1 {uint16} 2 

111 sint32 to sint16 zmm1 {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_i32scatter_epi32 (void*,_m512i, _m512i, int); 
void _mm512_mask_i32scatter_epi32 (void*,__mmask16,_m512i,__m512i, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
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Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form, and corresponding write-mask bit is not zero. 
#GP(0) If a memory address is in a non-canonical form, 


and corresponding write-mask bit is not zero. 
If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv 
mode, and corresponding write-mask bit is not zero. 
#PF(fault-code) If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 
If no write mask is provided or selected write-mask is k0. 
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VPSCATTERDQ - Scatter Int64 Vector With Signed Dword Indices 


532 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 AO vpscatterdq Mut {k1}, Scatter int64 vector Djg4(zmm1) to vector 
/r /vsib Diga(zmm1) memory locations mv, using doubleword in- 


dices and k1 as completion mask. 


Description 


Down-converts and stores all 8 elements in int64 vector UNDEF to the memory locations 
pointed by base address BAS E_ADDR and doubleword index vector VIN DEX, with 
scale SCALE. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Writes to overlapping destination memory locations are guaranteed to be ordered with 
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping 
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB 
of the source registers). Writes that are not overlapped may happen in any order. Memory 
ordering with other instructions follows the Intel-64 memory ordering model. Note that 
this does not account for non-overlapping indices that map into the same physical address 
locations. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully stored. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mu; as vector memory operand (VSIB) 
for (n = @; n < 8; n++) { 
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if (ktemp[n] != @) { 
i = 64x*n 
j = 32«n 
// mu,Ln] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE) 
pointer[63:0] = mv;,[n] 
tmp = DownConvStorej;g4(UNDEFLit+63:i], SSS[2:0]) 
if (DownConvStoreSizeOfjg4(SSS[2:0]) == 8) { 
MemStore(pointer) = tmp[63:0] 
} 
k1[n] = @ 
} 


I 
k1[15:8] = @ 


Flags Affected 


None. 


Memory Down-conversion: Dj¢4 


S251S9 || Function: Usage disp8*N 
000 no conversion zmm1 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_i32loscatter_epi64 (void*, _m512i,__m512i, int); 
void _mm512_mask_i32loscatter_epi64 (void*, mmask8,_m512i,__m512i, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 
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Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form, and corresponding write-mask bit is not zero. 
#GP(0) If a memory address is in a non-canonical form, 


and corresponding write-mask bit is not zero. 
If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv 
mode, and corresponding write-mask bit is not zero. 
#PF(fault-code) If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 
If no write mask is provided or selected write-mask is k0. 
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VPSHUFD - Shuffle Vector Doublewords 


Opcode Instruction Description 

MVEX.512.66.0F.W0 70 /rib vpshufd zmm1 {k1},zmm2/m;,imm8 Dword — shuffle int32 vector 
zmm2/m, and store the result 
in zmmi1, using imm8 , under 
write-mask. 


Description 


Shuffles 32 bit blocks of the vector read from memory or vector zmm2/mem using index 
bits in immediate. The result of the shuffle is written into vector zmm1. 


No swizzle, broadcast, or conversion is performed by this instruction. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


imm8 | 32 bit level permutation vector {dcba} | [7 Jg | Is In | [3 Ip | Ty Ip 


Operation 


src[511:0] = zmm2/m, 


// Intra-lane shuffle 
for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
1 = 32x*n 
// offset within 128-bit chunk 
j = 32*((perm32 >> 2*(n & @x3)) & @x3) 
// 128-bit level offset 
j = j + 128*(n >> 2) 
zmm1[i+31:i] = srcljt+31:j] 
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Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_shuffle_epi32 (_m512i, MM_PERM_ENUM); 
—m512i _mm512_mask_shuffle_epi32 (_m512i, _mmask16, _m512i, 
_MM_PERM_ENUM); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 
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VPSLLD - Shift Int32 Vector Immediate Left Logical 


Opcode Instruction Description 
MVEX.NDD.512.66.0F.W0 vpslld zmm1 {k1}, Shiftleftint32 vector S;32(zmm2/m,) and store 
72 /6ib Sizo(zmm2/mz), 7mms the result in zmm1, using 7mm8, under write- 
mask. 
Description 


Performs an element-by-element logical left shift of the result of the swizzle/broadcast/conversion 
process on memory or vector int32 zmm2, shifting by the number of bits specified in im- 
mediate field. The result is stored in int32 vector zmm1. 


If the value specified by the shift operand is greater than 31 then the destination operand 
is set to all Os. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32(zmm2/m+) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// integer operation 
zmm1[i+31:i] = tmpSrc2[i+31:i] << IMM8[7:0] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_slli_epi32 (_m512i, unsigned int); 
—m512i _mm512_mask_slli_epi32 (_m512i,_mmask16, _m512i, unsigned int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSLLVD - Shift Int32 Vector Left Logical 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpsllvd zmm1 {kl}, zmmz2, Shift left int32 vector zmm2 and int32 vector 

47 /r Sizo(zmm3/mz) Si3o(zmm3/m;) and store the result in zmm1, 
under write-mask. 


Description 


Performs an element-by-element left shift of int32 vector zmm2, shifting by the number 
of bits specified by the int32 vector result of the swizzle/broadcast/conversion process 
on memory or vector int32 zmm3. The result is stored in int32 vector zmm1. 


If the value specified by the shift operand is greater than 31 then the destination operand 
is set to all Os. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m,) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// signed integer operation 
zmm1[i+31:i] = zmm2[i+31:i] << tmpSrc3[i+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_sllv_epi32 (_m512i,__m512i); 
—m512i _mm512_mask_sllv_epi32 (_m512i,_ mmask16,__m512i,_m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSRAD - Shift Int32 Vector Immediate Right Arithmetic 


Opcode Instruction Description 

MVEX.NDD.512.66.0F.W0 vpsrad zmm1 {k1}, Shift right arithmetic int32 vector 

72 /4 ib Sizo(zmm2/mz), 7mms Siz0(zmm2/m;,) and store the result in zmm1, 
using 7mmé8, under write-mask. 


Description 


Performs an element-by-element arithmetic right shift of the result of the swizzle/broadcast/conversion 
process on memory or vector int32 zmm2, shifting by the number of bits specified in im- 
mediate field. The result is stored in int32 vector zmm1. 


An arithmetic right shift leaves the sign bit unchanged after each shift count, so the final 
result has the 7+1 msbs set to the original sign bit, where 7 is the number of bits by which 
to shift right. 


If the value specified by the shift operand is greater than 31 each destination data element 
is filled with the initial value of the sign bit of the element. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32(zmm2/m+) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32an 
// signed integer operation 
zmm1[it+31:i] = tmpSrc2[i+31:i] >> IMM8[7:0] 
} 
} 


Flags Affected 


None. 


Reference Number: 327364-001 543 


> 
D 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_srai_epi32 (_m512i, unsigned int); 
_—m512i _mm512_mask_srai_epi32 (_m512i,_mmask16,_m512i, unsigned int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSRAVD - Shift Int32 Vector Right Arithmetic 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vpsravd zmmi {ki}, zmm2, Shift right arithmetic int32 vector zmm2 and 

46 /r Sizo(zmm3/mz) int32 vector S;32(zmm3/m;) and store the re- 
sult in zmm1, under write-mask. 


Description 


Performs an element-by-element arithmetic right shift of int32 vector zmm2, shifting by 
the number of bits specified by the int32 vector result of the swizzle/broadcast/conversion 
process on memory or vector int32 zmm3. The result is stored in int32 vector zmm1. 


An arithmetic right shift leaves the sign bit unchanged after each shift count, so the final 
result has the 7+1 msbs set to the original sign bit, where 7 is the number of bits by which 
to shift right. 


If the value specified by the shift operand is greater than 31 each destination data element 
is filled with the initial value of the sign bit of the element. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

3 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32an 
// signed integer operation 
zmm1[i+31:i] = zmm2[i+31:i] >> tmpSrc3[i+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_srav_epi32 (_m512i,__m512i); 
_—m512i _mm512_mask_srav_epi32 (_m512i,_mmask16,_m512i, 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSRLD - Shift Int32 Vector Immediate Right Logical 


Opcode Instruction Description 
MVEX.NDD.512.66.0F.W0 vpsrld zmm1 {k1}, Shift right logical int32 vector S;32(zmm2/m) 
72 /2ib Sizo(zmm2/m:z), 7mms and store the result in zmm1, using 7mm8, un- 


der write-mask. 


Description 


Performs an element-by-element logical right shift of the result of the swizzle/broadcast/conversion 
process on memory or vector int32 zmm2, shifting by the number of bits specified in im- 
mediate field. The result is stored in int32 vector zmm1. 


A logical right shift shifts a 0-bit into the msb for each shift count, so the final result has 
the 2 msbs set to 0, where 7 is the number of bits by which to shift right. 


If the value specified by the shift operand is greater than 31 then the destination operand 
is set to all Os. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;39 (zmm2/m,) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// signed integer operation 
zmm1[i+31:i] = tmpSrc2[i+31:i] >> IMM8[7:0] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_srli_epi32 (_m512i, unsigned int); 
_—m512i _mm512_mask_srli_epi32 (_m512i,__mmask16, _m512i, unsigned int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


551 


CHAPTER 6. INSTRUCTION DESCRIPTIONS (intel 
VPSRLVD - Shift Int32 Vector Right Logical 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpsrlvd zmm1 {ki}, zmm2, Shift right logical int32 vector zmm2 and int32 
45 /r Sizo(zmm3/mz) vector 5;32(zmm3/m,;) and store the result in 


zmm1, under write-mask. 


Description 


Performs an element-by-element logical right shift of int32 vector zmm2, shifting by the 
number of bits specified by the int32 vector result of the swizzle/broadcast/conversion 
process on memory or vector int32 zmm3. The result is stored in int32 vector zmm1. 


A logical right shift shifts a 0-bit into the msb for each shift count, so the final result has 
the 2 msbs set to 0, where 7 is the number of bits by which to shift right. 


If the value specified by the shift operand is greater than 31 then the destination operand 
is set to all Os. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;39 (zmm3/m,) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
1 = 32&n 
// signed integer operation 
zmm1[i+31:i] = zmm2[i+31:i] >> tmpSrc3[i+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_srlv_epi32 (_m512i,_m512i); 
—m512i _mm512_mask_srlv_epi32 (_m512i,_ mmask16,__m512i,__m512i); 
Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSUBD - Subtract Int32 Vectors 


Opcode Instruction Description 


MVEX.NDS.512.66.0F.W0 vpsubd zmm1 {ki}, zmm2, Subtract int32 vector Sj;32.(zmm3/m,) from 
FA /r Si32(zmm3/m;) int32 vector zmm2 and store the result in 


zmm1, under write-mask. 


Description 


Performs an element-by-element subtraction from int32 vector zmm2 of the int32 vector 
result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. 
The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m+) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32xn 
// integer operation 
zmm1[i+31:i] = zmm2[i+31:i] - tmpSrc3[it+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_sub_epi32 (_m512i,__m512i); 


—m512i _mm512_mask_sub_epi32 (_m512i,__mmask16,__m512i,__m512i); 


— 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction Description 


MVEX.NDS.512.66.0F38.WO vpsubrd zmm1 {k1}, zmm2, Subtract int32 vector zmm2 from int32 vector 
Si32(zmm3/m,) and store the result in zmm1, 


6C /r Sizo(zmm3/m,) 


under write-mask. 


Description 


Performs an element-by-element subtraction of int32 vector zmm2 from the int32 vector 
result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. 
The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32xn 
// integer operation 
zmm1[i+31:i] = -zmm2[i+31:i] + tmpSrc3[i+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i _mm512_subr_epi32 (_m512i,_m512i); 
_—m512i _mm512_mask_subr_epi32 (_m512i,__mmask16,__m512i,__m512i); 


—— 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSUBRSETBD - Reverse Subtract Int32 Vectors and Set Borrow 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vpsubrsetbd zmm1 {k1}, k2, Subtract int32 vector zmm1 from int32 vector 
6F /r Si32(zmm3/m,) Si32(zmm3/m,) and store the subtraction in 


zmm1 and the borrow from the subtraction in 
k2, under write-mask. 


Description 


Performs an element-by-element subtraction of int32 vector zmm1 from the int32 vector 
result of the swizzle/broadcast/conversion process on memory or int32 vector zmm3. 
The result is written into int32 vector zmm1. 


In addition, the borrow from the subtraction for the n-th element is written into the n-th 
bit of vector mask k2. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1 
and k2 with the corresponding bit clear in k1 retain their previous value. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 324n 
// integer operation 
k2[n] = Borrow(tmpSrc3[i+31:i] - zmm1[i+31:i]) 
zmm1[it+31:i] = tmpSrc3[i+31:i] - zmm1[i+31:i] 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_subrsetb_epi32 (_m512i,__m512i,__mmask16*); 
_—m512i _mm512_mask_subrsetb_epi32 (_m512i, _mmask16, _mmask16, _m512i, 
__mmask16*); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
#UD 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPSUBSETBD - Subtract Int32 Vectors and Set Borrow 
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Opcode Instruction Description 


MVEX.NDS.512.66.0F38.W0 vpsubsetbd zmmi_ {kl}, k2, Subtract int32 vector Sj32.(zmm3/m:) from 


5F /r Si32(zmm3/m,) int32 vector zmm1 and store the subtraction in 
zmm1 and the borrow from the subtraction in 
k2, under write-mask. 
Description 


Performs an element-by-element subtraction of the int32 vector result of the swiz- 
zle/broadcast/conversion process on memory or int32 vector zmm3 from int32 vector 
zmm1. The result is written into int32 vector zmm1. 


In addition, the borrow from the subtraction for the n-th element is written into the n-th 
bit of vector mask k2. 


This instruction is write-masked, so only those elements with the corresponding bit set in 
vector mask register k1 are computed and stored into zmm1 and k2. Elements in zmm1 
and k2 with the corresponding bit clear in k1 retain their previous value. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m;) 

} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 324n 
// integer operation 
k2[n] = Borrow(zmm1[i+31:i] - tmpSrc3[i+31:i]) 
zmm1[i+31:i] = zmm1[i+31:i] - tmpSrc3[i+31:i] 


Flags Affected 


None. 


Reference Number: 327364-001 


= 
2 


CHAPTER 6. 


INSTRUCTION DESCRIPTIONS 


Memory Up-conversion: S;35 


S251Spo || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_subsetb_epi32 (_m512i,_m512i,__mmask16*); 
_—m512i _mm512_mask_subsetb_epi32 (__m512i, _mmask16, _ 
__mmask16*); 
Exceptions 


Real-Address Mode and Virtual-8086 
#UD Instruction not available in these modes 
Protected and Compatibility Mode 


#UD Instruction not available in these modes 
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64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 
mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPTESTMD - Logical AND Int32 Vectors and Set Vector Mask 


Opcode Instruction Description 
MVEX.NDS.512.66.0F38.WO vptestmd k2 {kl}, zmmt1, Perform a bitwise AND between int32 vector 
27 /r Sizo(zmm2/mz) zmm1 and int32 vector Sj32(zmm2/m,), and 


set vector mask k2 to reflect the zero/non- 
zero status of each element of the result, under 
write-mask. 


Description 


Performs an element-by-element bitwise AND between int32 vector zmm1 and the int32 
vector result of the swizzle/broadcast/conversion process on memory or int32 vector 
zmm_2, and uses the result to construct a 16 bit vector mask, with a 0-bit for each element 
for which the result of the AND was 0, and a 1-bit where the result of the AND was not 
zero. The final result is written into vector mask k2. 


The write-mask does not perform the normal write-masking function for this instruction. 
While it does enable/disable comparisons, it does not block updating of the destination; 
instead, if a write-mask bit is 0, the corresponding destination bit is set to 0. Nonethe- 
less, the operation is similar enough so that it makes sense to use the usual write-mask 
notation. This mode of operation is desirable because the result will be used directly as a 
write-mask, rather than the normal case where the result is used with a separate write- 
mask that keeps the masked elements inactive. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad;32(zmm2/m+) 


} 
for (n = @; n < 16; n++) { 
k2[n] = @ 
if(ki[n] != @) { 
1 = 32an 


// signed integer operation 
if (C(zmm1[i+31:i] & tmpSrc2[i+31:i]) != @) { 
k2[n] = 1 
} 
} 
} 
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Flags Affected 


None. 


Memory Up-conversion: S;35 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S255 || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_mmask16 _mm512_test_epi32_mask (_m512i,__m512i); 
_mmask16 _mm512_mask_test_epi32_mask (_mmask16, _m512i,_m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
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#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPXORD - Bitwise XOR Int32 Vectors 
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Opcode Instruction Description 

MVEX.NDS.512.66.0F.W0O vpxord zmm1 {kl}, zmm2, Perform a bitwise XOR between int32 vector 

EF /r Si32(zmm3/m;) zmm_2 and int32 vector S;32(zmm3/m,) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise XOR between int32 vector zmm2 and the int32 
vector result of the swizzle/broadcast/conversion process on memory or int32 vector 
zmm3. The result is written into int32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
tmpSrc3[511:0] = SwizzUpConvLoad;32 (zmm3/m+) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32x*n 
// signed integer operation 
zmm1[i+31:i] = zmm2[i+31:i] * tmpSrc3[it+31:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;35 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 


Register Swizzle: S;35 


MVEX.EH=0 

S5S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512i _mm512_xor_epi32 (_m512i,__m512i); 
—m512i _mm512_mask_xor_epi32 (_m512i,__mmask16,__m512i,__m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 
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#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VPXORQ - Bitwise XOR Int64 Vectors 


Opcode Instruction 


Description 


MVEX.NDS.512.66.0FW1 vpxorq zmm1_ {k1}, zmm2, Perform a bitwise XOR between int64 vector 


EF /r Siga(zmm3/mz) 


zmm_2 and int64 vector Sig4(zmm3/m;) and 
store the result in zmm1, under write-mask. 


Description 


Performs an element-by-element bitwise XOR between int64 vector zmm2 and the int64 
vector result of the swizzle/broadcast/conversion process on memory or int64 vector 


zmm3. The result is written into int64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 


tmpSrc3[511:0] = zmm3[511:0] 
} else { 

tmpSrc3[511:0] = SwizzUpConvLoadjg4 (zmm3/m;) 
} 


for (n = @; n < 8; n++) { 
if(ki[n] != 0) { 
i = 644n 
zmm1[it+63:i] = zmm2[i+63:i] * tmpSrc3[i+63:i] 
} 
} 


Flags Affected 


None. 
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Memory Up-conversion: S;64 


S251S9 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: Sig, 


MVEX.EH=0 

S55S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 

010 swap with two-away zmm0 {badc} 

011 cross-product swizzle zmm0 {dacb} 

100 broadcast a element zmm0 {aaaa} 

101 broadcast b element zmm0 {bbbb} 

110 broadcast c element zmm0 {cccc} 

111 broadcast d element zmm0 {dddd} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512i 
—m512i 


Exceptions 


_mm512_xor_epi64 (_m512i,__m512i); 


_mm512_mask_xor_epi64 (_m512i,__mmask8 


Real-Address Mode and Virtual-8086 


#UD 


m512i,__m512i); 


| 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
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#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


Reference Number: 327364-001 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VRCP23PS - Reciprocal of Float32 Vector 


Opcode Instruction Description 

MVEX.512.66.0F38.W0 CA vrcp23ps zmm1 {k1}, zmm2/m, Compute the approximate reciprocals float32 

/v vector zmm2/m, and store the result in zmm1, 
under write-mask. 


Description 


Computes the element-by-element reciprocal approximation of the float32 vector on 
memory or float32 vector zmm2 with 0.912ULP (relative error). The result is written 
into float32 vector zmm1. 


If any source element is NaN, the quietized NaN source value is returned for that element. 
If any source element is +00, 0.0 is returned for that element. Also, if any source element 
is +0.0, too is returned for that element. 


Current implementation of this instruction does not support any SwizzUpConv setting 
other than "no broadcast and no conversion"; any other SwizzUpConv setting will result 
in an Invalid Opcode exception. 


recip_1ulp() function follows Table 6.26 when dealing with floating-point special number. 


Input | Result Comment 

NaN _ | input qNaN | raise #I flag if sNaN 
+00 +0 

+0 +00 raise #Z flag 

-0 —oo raise #Z flag 

—oo —0 

2 2-2 exact result 


Table 6.26: recip_lulp() special floating-point values behavior 
This instruction is write-masked, so only those elements with the corresponding bit set 


in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


tmpSrc2[511:0] = zmm2/m; 
if(source is a register operand and MVEX.EH bit is 1) { 


if (SSS[2]==1) Supress_Exception_Flags() // SAE 
} 


for (n = @; n < 16; n++) { 
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if (ki[n] != @) { 

i = 32an 

zmm1[i+31:i] = recip_lulp(tmpSrc2[i+31:i]) 
} 


} 


SIMD Floating-Point Exceptions 


Invalid, Zero. 


Denormal Handling 


Treat Input Denormals As Zeros : 
YES 


Flush Tiny Results To Zero : 
YES 


Register Swizzle 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 reserved N/A 
010 reserved N/A 
011 reserved N/A 
100 reserved N/A 
101 reserved N/A 
110 reserved N/A 
111 reserved N/A 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_rcp23_ps (_m512); 
—m512 _mm512_mask_rcp23_ps (_m512,_mmask16,_m512); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 
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VRNDFXPNTPD - Round Float64 Vector 


Opcode 
MVEX.512.66.0F3A.W1 52 /r ib 


Instruction 


vrndfxpntpd zmm1 {k1}, Sye4(zmm2/m,), imms 


Description 

Round float64 vector 
Sfea(zmm2/m,) and 
store the result in 
zmm1, using immé8, 
under write-mask. 


Description 


Performs an element-by-element rounding of the result of the swizzle/broadcast/conversion 
from memory or float64 vector zmm2. The rounding result for each element is a float64 
containing an integer or fixed-point value, depending on the value of expadj; the direction 

of rounding depends on the value of RC. The result is written into float64 vector zmm1. 


This instruction doesn't actually convert the result to an int64; the results are float64s, 
just like the input, but are float64s containing the integer or fixed-point values that result 
from the specified rounding and scaling. 


RoundToInt() function follows Table 6.27 when dealing with floating-point special num- 


ber. 


Table 6.27: RoundToInt() special floating-point values behavior 


Input | Result 

NaN _ | quietized input NaN 
+00 +00 

+0 +0 

-0 —0 

—oo —oo 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 


the corresponding bit clear in k1 retain their previous values. 
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Immediate Format 


— 
na 


Rounding Mode 

rn | Round to Nearest (even) 

rd | Round Down (Round toward Negative Infinity) 
ru | Round Up (Round toward Positive Infinity) 

rz | Round toward Zero 


Rl PR} olo 
RB] Ol R| ol 


Exponent Adjustment | value Iz Ig Is I 
0 2° (64.0- no exponentadjustment) | 0 0 0 0 
4 2* (60.4) 0 0 O0O 1 
5 2° (59.5) eo 0 £0 
8 28 (56.8) 0 0 1 1 
16 27 (48.16) 0 1 0 O 
24 274 (40.24) 0 1 0 1 
31 9! (33.31) 01 to 
32 2°* (32.22) Oo 1 1 1 
reserved *must UD* 1 x x x 
Operation 


RoundingMode = IMM8[1:0] 
expadj = IMM8[6:4] 


if(source is a register operand and MVEX.EH bit is 1) { 
if (SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad fea (zmm2/m) 

} 


for (n = @; n < 8; n++) { 
if(ki[n] != @) { 
i = 64*n 
// float64 operation 
zmm1[i+63:i] = 
RoundToInt (tmpSrc2[i+63:i] * EXPADJ_TABLE[Lexpadj], RoundingMode) / 
EXPADJ_TABLE[expadj] 


SIMD Floating-Point Exceptions 


Invalid, Precision. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 


NO 


Memory Up-conversion: S ¢¢4 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S ;¢4 


MVEX.EH=0 

S_515p9 || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d 
_—m512d 


_mm512_roundfxpnt_adjust_pd (_m512d, int, MM_EXP_ADJ_ENUM); 
_mm512_mask_roundfxpnt_adjust_pd (_m512d 


_MM_EXP_ADJ_ENUM); 


Reference Number: 327364-001 


: 


mmask8, _m512d, 


581 


> 
D 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 


mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VRNDFXPNTPS - Round Float32 Vector 


Opcode Instruction Description 

MVEX.512.66.0F3A.W0 52 /rib vrndfxpntps zmm1 {k1}, S32(zmm2/m;),i1mm8 Round float32 vector 
S'p32(zmm2/m;,) and 
store the result in 
zmm1, using zmmé8, 
under write-mask. 


Description 


Performs an element-by-element rounding of the result of the swizzle/broadcast/conversion 
from memory or float32 vector zmm2. The rounding result for each element is a float32 
containing an integer or fixed-point value, depending on the value of expadj; the direction 

of rounding depends on the value of RC. The result is written into float32 vector zmm1. 


This instruction doesn't actually convert the result to an int32; the results are float32s, 
just like the input, but are float32s containing the integer or fixed-point values that result 
from the specified rounding and scaling. 


RoundToInt() function follows Table 6.28 when dealing with floating-point special num- 
ber. 


This instruction treats input denormals as zeros according to the DAZ control bit, but does 
not flush tiny results to zero. 


Input | Result 
NaN _ | quietized input NaN 


+00 +00 
+0 +0 
-0 —0 
—oo —oo 


Table 6.28: RoundToInt() special floating-point values behavior 
This instruction is write-masked, so only those elements with the corresponding bit set 


in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Immediate Format 


— 
ran 


Rounding Mode 

rn | Round to Nearest (even) 

rd | Round Down (Round toward Negative Infinity) 
ru | Round Up (Round toward Positive Infinity) 

rz | Round toward Zero 


Rl PR} olo 
Rm] OolR| ol 
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Exponent Adjustment | value Iz Ig Is I 
0 2° (32.0- no exponent adjustment) | 0 0 0 0 
4 27 (28.4) 0 0 0 1 
5 9° (27.5) 0 0 1 =O 
8 25 (24.8) 0 0 1 4 
16 27 (16.16) oO £' 2) oO 
24 274 (8.24) 0 1 0 1 
31 or 131) 0 1 1 =O 
32 9° (0.32) a a res | 
reserved *must UD* 1 x x x 
Operation 


RoundingMode = IMM8[1:0] 
expadj = IMM8[6:4] 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
tmpSrc2[511:0] = zmm2[511:0] 

} else { 
tmpSrc2[511:0] = SwizzUpConvLoad ¢32 (zmm2/m) 

} 


for (n = @; n < 16; n++) { 
if(ki[n] != @) { 
i = 32x*n 
// float32 operation 
zmm1[i+31:i] = 
RoundToInt (tmpSrc2[i+31:i] * EXPADJ_TABLE[Lexpadj], RoundingMode) / 
EXPADJ_TABLELexpadj] 


SIMD Floating-Point Exceptions 


Invalid, Precision. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
NO 
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Memory Up-conversion: S 32 

S2515o || Function: Usage disp8*N 

000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S ;39 


MVEX.EH=0 

S55S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

1xx SAE (Supress-All-Exceptions) , {sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 
_—m512 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


Instruction not available in these modes 
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_mm512_roundfxpnt_adjust_ps (_m512, int, MM_EXP_ADJ_ENUM); 
_mm512_mask_roundfxpnt_adjust_ps (_m512, 
_MM_EXP_ADJ_ENUM); 


_m512, int , 
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64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 
If a memory operand linear address is not aligned 
to the data size granularity dictated by SwizzUpConv 
mode. 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VRSQRTZ23PS - Vector Reciprocal Square Root of Float32 Vector 


/v 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 CB vrsqrt23ps zmm1 {k1},zmm2/m,; Reciprocal square _ root 


float32 


vector 


zmm2/m, and store the result in zmm1, 


under write-mask. 


Description 


Computes the element-by-element reciprocal square root of the float32 vector on memory 
or float32 vector zmm2 with a precision of 0.775ULP (relative error). The result is written 
into float32 vector zmm1. 


If any source element is NaN, the quietized NaN source value is returned for that element. 
Negative source numbers, as well as —oo, return the canonical NaN and set the Invalid 
Flag (#I). 


Current implementation of this instruction does not support any SwizzUpConv setting 
other than "no broadcast and no conversion"; any other SwizzUpConv setting will result 
in an Invalid Opcode exception. 


rsqrt_lulp() function follows Table 6.29 when dealing with floating-point special number. 


For an input value of +/ — 0 the instruction returns —oo and sets the Divide-By-Zero flag 
(#Z). Negative numbers should return NaN and set the Invalid flag (#1). Note however that 
this instruction treats input denormals as zeros of the same sign, so for denormal negative 
inputs it returns —oo and sets the Divide-By-Zero status flag. 


Input | Result Comments 

NaN _ | input qNaN | Raise #I flag if sNaN 
+00 +0 

+0 +00 Raise #Z flag 

—0 —oo Raise #Z flag 

<0 NaN Raise #I flag 

—oo NaN Raise #I flag 

a a exact result 


Table 6.29: rsqrt_1ulp() special floating-point values behavior 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 
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Operation 


tmpSrc2[511:0] = zmm2/m; 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
} 
for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
i = 32x*n 
zmm1[i+31:i] = rsqrt_lulp(tmpSrc2[i+31:i]) 
} 
a; 


SIMD Floating-Point Exceptions 


Invalid, Zero. 


Denormal Handling 


Treat Input Denormals As Zeros : 
YES 


Flush Tiny Results To Zero : 


YES 

Register Swizzle 
MVEX.EH=0 
S55S1Spo || Function: 4 x 32 bits Usage 
000 no swizzle zmm0 or zmm0 {dcba} 
001 reserved N/A 
010 reserved N/A 
011 reserved N/A 
100 reserved N/A 
101 reserved N/A 
110 reserved N/A 
111 reserved N/A 
MVEX.EH=1 
S$25Sq || Rounding Mode Override Usage 
1xx SAE (Supress-All-Exceptions) , {sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512 _ICL_INTRINCC_mm512_rsqrt23_ps (_m512); 
_—m512 _ICL_INTRINCC_mm512_mask_rsqrt23_ps (_m512,_mmask16,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 
#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 

in a non-canonical form. 

If the memory address is in a non-canonical form. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 

This instruction does not support any 

SwizzUpConv different from the default value (no broadcast, 
no conversion). If SwizzUpConv function is set to any value 
different than "no action", then an Invalid Opcode fault is 
raised. This includes register swizzles. 
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VSCALEPS - Scale Float32 Vectors 
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Opcode Instruction Description 
MVEX.NDS.512.66.0F38.W0O vscaleps zmm1 {k1}, zmm2, Multiply float32 vector zmm2 by 2 raised to the 
84 /r Sizo(zmm3/mz) int32 vector S;32(zmm3/m;) and store the re- 


sult in zmm1, under write-mask. 


Description 


Performs an element-by-element scale of float32 vector zmm2 by multiplying it by 2°*?, 
where exp is the vector int32 result of the swizzle/broadcast/conversion process on 
memory or vector int32 zmm3. The result is written into vector float32 zmm1. 


This instruction is needed for scaling u and v coordinates according to the mipmap size, 
which is 2"P™4p_!evel, and for the evaluation of Exp2. 


Cases where the exponent would go out of range are handled as if multiplication (via 
vmulps) of zmm2 by 27””"? had been performed. 


If the result cannot be represented with a float32, then the properly signed oo (for positive 
scaling operand) or 0 (for negative scaling operand) will be returned. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
RoundingMode = SSS[1:@] 
tmpSrc3[511:0] = zmm3[511:0] 

} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoadI (zmm3/mz) 

} 


for (n = @; n < 16; n++) { 
if (ki[n] != @) { 
1 = 324n 
expl31:0] = tmpSrc3[i+31:i] 
// signed int scale operation. float32 multiplication 
zmm1[i+31:i] = zmm2[i+31:i] * 2°7P[31:0] 


Reference Number: 327364-001 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S;35 


S2515So || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 reserved N/A N/A 
100 uint8 to uint32 [rax] {uint8} 16 

101 sint8 to sint32 [rax] {sint8} 16 

110 uint16 to uint32 [rax] {uint16} 32 

111 sint16 to sint32 [rax] {sint16} 32 
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Register Swizzle: S;35 


MVEX.EH=0 

S2S1Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

52515 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 


Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_scale_ps (_m512,__m512i); 
—m512 _mm512_mask_scale_ps (_m512,_mmask16,_m512,__m512i); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 


in a non-canonical form. 
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#GP(0) 


#PF(fault-code) 
#NM 
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If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


593 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VSCATTERDPD - Scatter Float64 Vector With Signed Dword Indices 
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Opcode Instruction Description 
MVEX.512.66.0F38.W1 A2_ vscatterdpd Mut {k1}, Scatter float64 vector Dyg4(zmm1) to vector 
/r /vsib D yea(zmm1) memory locations mv, using doubleword in- 


dices and k1 as completion mask. 


Description 


Down-converts and stores all 8 elements in float64 vector zmm1 to the memory locations 
pointed by base address BAS E_ADDR and doubleword index vector VIN DEX, with 
scale SCALE. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Writes to overlapping destination memory locations are guaranteed to be ordered with 
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping 
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB 
of the source registers). Writes that are not overlapped may happen in any order. Memory 
ordering with other instructions follows the Intel-64 memory ordering model. Note that 
this does not account for non-overlapping indices that map into the same physical address 
locations. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully stored. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mu; as vector memory operand (VSIB) 
for (n = @; n < 8; n++) { 
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if (ktemp[n] != @) { 
i = 64x*n 
j = 32«n 
// mv,Ln] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE) 
pointer[63:0] = mv,L[n] 
tmp = DownConvStore rg4(zmm1Lit+63:i], SSS[2:0]) 
if (DownConvStoreSizeOf ¢g4(SSS[2:0]) == 8) { 
MemStore(pointer) = tmp[63:0] 
} 
k1[n] = @ 
} 


I 
k1[15:8] = @ 


SIMD Floating-Point Exceptions 


None. 


Memory Down-conversion: D r¢4 


S2515S9 || Function: Usage disp8*N 
000 no conversion zmm1 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_i32loextscatter_pd (void*, _m512i, _m512d, 
_MM_DOWNCONV_PD_ENUM, int, int); 

void _mm512_mask_i32loextscatter_.pd (void*, _mmask8, _m512i, _m512d, 
_MM_DOWNCONV_PD_ENUM, int, int); 

void _mm512_i32loscatter_pd (void*, m512i,__m512d, int); 

void _mm512_mask_i32loscatter_pd (void*,__mmask8,__m512i,__m512d, int); 


Reference Number: 327364-001 595 


> 
D 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form, and corresponding write-mask bit is not zero. 
#GP(0) If a memory address is in a non-canonical form, 


and corresponding write-mask bit is not zero. 
If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv 
mode, and corresponding write-mask bit is not zero. 
#PF(fault-code) If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 
If no write mask is provided or selected write-mask is k0. 
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VSCATTERDPS - Scatter Float32 Vector With Signed Dword Indices 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 A2_ vscatterdps Mut {k1}, Scatter float32 vector Dy32(zmm1) to vector 
/r /vsib D y32(zmm1) memory locations mv, using doubleword in- 


dices and k1 as completion mask. 


Description 


Down-converts and stores all 16 elements in float32 vector zmm1 to the memory loca- 
tions pointed by base address BAS E_ADDR and doubleword index vector VINDEX, 
with scale SCALE. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
gather/scatter sequence have been loaded/stored and hence, the write-mask bits all are 
zero). 


Writes to overlapping destination memory locations are guaranteed to be ordered with 
respect to each other (from LSB to MSB of the source registers). Only writes to overlapping 
vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB 
of the source registers). Writes that are not overlapped may happen in any order. Memory 
ordering with other instructions follows the Intel-64 memory ordering model. Note that 
this does not account for non-overlapping indices that map into the same physical address 
locations. 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully stored. 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
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if (ktemp[n] != @) { 

1 = 32an 

// mu,Ln] = BASE_ADDR + SignExtend(VINDEXLi+31:i] * SCALE) 

pointer[63:0] = mv;,L[n] 

tmp = DownConvStore f32(zmm1[i+31:i], SSS[2:@]) 

if (DownConvStoreSizeOf ¢32(SSS[2:0]) == 4) { 
MemStore(pointer) = tmp[31:0] 

} else if (DownConvStoreSizeOf f32(SSS[2:0]) == 2) { 
MemStore(pointer) = tmp[15:0] 

} else if (DownConvStoreSizeOf p32(SSS[2:0]) == 1) { 
MemStore(pointer) = tmp[7:0] 

} 

k1[n] = @ 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 


Memory Down-conversion: D;35 


S25159 || Function: Usage disp8*N 
000 no conversion zmm1 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float32 to float16 zmm1 {float16} 2 

100 float32 to uint8 zmm1 {uint8} 1 

101 float32 to sint8 zmm1 {sint8} 1 

110 float32 to uint16 zmm1 {uint16} 2 

111 float32 to sint16 zmm1 {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_i32extscatter_ps (void*, _m512i, _m512, 
_MM_DOWNCONV_PS_ENUM,, int, int); 
void _mm512_mask_i32extscatter_s (void*, _mmask16, _m512i, _m512, 


_MM_DOWNCONV_PS_ENUM,, int, int); 
void _mm512_i32scatter_ps (void*, m512i,__m512, int); 
void _mm512_mask_i32scatter_ps (void*,__mmask16,_m512i,_m512, int); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form, and corresponding write-mask bit is not zero. 
#GP(0) If a memory address is in a non-canonical form, 


and corresponding write-mask bit is not zero. 
If a memory operand linear address is not aligned 
to element-wise data granularity dictated by the DownConv 
mode, and corresponding write-mask bit is not zero. 
#PF(fault-code) If a memory operand linear address produces a page fault 
and corresponding write-mask bit is not zero. 
#NM If CRO.TS[bit 3]=1. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 
If no write mask is provided or selected write-mask is k0. 


Reference Number: 327364-001 599 


(intel 
CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VSCATTERPFODPS - Scatter Prefetch Float32 Vector With Signed Dword 
Indices Into L1 


600 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 C6 vscatterpf0dps Ur32(mv;) {k1} Scatter Prefetch float32 vector U32(mv,), us- 
/5 /vsib ing doubleword indices with TO hint, under 
write-mask. 
Description 


Prefetches into the L1 level of cache the memory locations pointed by base address 
BASE_ADDR and doubleword index vector VINDEX, with scale SCALE, with re- 
quest for ownership (exclusive). Up-conversion operand specifies the granularity used 
by compilers to better encode the instruction if a displacement, using disp8*N feature, is 
provided when specifying the address. If any memory access causes any type of mem- 
ory exception, the memory access will be considered as completed (destination mask up- 
dated) and the exception ignored. Up-conversion parameter is optional, and it is used to 
correctly encode disp8*N. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
prefetch sequence have been prefetched and hence, the write-mask bits all are zero). 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after up-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully stored. 


Note that both gather and scatter prefetches set the access bit (A) in the related TLB page 
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D). 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


exclusive = 1 
evicthintpre = MVEX.EH 
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// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (ktemp[n] != @) { 
1 = 32&n 
// mu,Ln] = BASE_ADDR + SignExtend(VINDEXLi+31:i] * SCALE) 
pointer[63:0] = mu;[n] 
FetchLi1cacheLine(pointer, exclusive, evicthintpre) 
k1[n] = @ 


SIMD Floating-Point Exceptions 


None. 


Memory Up-conversion: U ;32 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_prefetch_i32extscatter_ps (void*, _m512i, _MM_UPCONV_PS_ENUM, 
int, int); 

void _mm512_mask_prefetch_i32extscatter_ps(void*, __mmask16, _m512i, 
_MM_UPCONV_PS_ENUM, int, int); 

void _mm512_prefetch_i32scatter_ps(void*, _m512i, int, int); 

void _mm512_mask_prefetch_i32scatter_ps(void*, mmask16, _m512i, int, int); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If using a 16 bit effective address. 

If ModRM.rm is different than 100b. 

If no write mask is provided or selected write-mask is k0. 
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VSCATTERPFOHINTDPD - Scatter Prefetch Float64 Vector Hint With Signed 
Dword Indices 


Opcode Instruction Description 
MVEX.512.66.0F38.W1 C6 vscatterpfOhintdpd Uyea(mu;) Scatter Prefetch float64 vector Usga(mv;), us- 
/4 /vsib {k1} ing doubleword indices with TO hint, under 
write-mask. 
Description 


The instruction specifies a set of 8 float64 memory locations pointed by base address 
BASE_ADDR and doubleword index vector VN DEX with scale SCALE as a perfor- 
mance hint that a real scatter instruction with the same set of sources will be invoked. A 
programmer may execute this instruction before a real scatter instruction to improve its 
performance. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. This instructions does not 
modify any kind of architectural state (including the write-mask). 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Operation 


// Use mu, as vector memory operand (VSIB) 
for (n = @; n < 8; nt+) { 
if (k1[n] != 0) { 
i = 644n 
j = 32«n 
// mvu,Ln] = BASE_ADDR + SignExtend(VINDEX[j+31:j] * SCALE) 
pointer[63:0] = mv,Ln] 
HintPointer (pointer) 


SIMD Floating-Point Exceptions 


None. 
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Memory Up-conversion: U s¢4 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] 8 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Intel’ C/C++ Compiler Intrinsic Equivalent 


None 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 
#UD If processor model does not implement the specific instruction. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If using a 16 bit effective address. 

If ModRM.rm is different than 100b. 

If no write mask is provided or selected write-mask is k0. 
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VSCATTERPFOHINTDPS - Scatter Prefetch Float32 Vector Hint With Signed 
Dword Indices 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 C6 vscatterpfOhintdps Urz32(mv,) Scatter Prefetch float32 vector Us32(mz;), us- 
/4 /vsib {k1} ing doubleword indices with TO hint, under 
write-mask. 
Description 


The instruction specifies a set of 16 float32 memory locations pointed by base address 
BASE_ADDR and doubleword index vector VN DEX with scale SCALE as a perfor- 
mance hint that a real scatter instruction with the same set of sources will be invoked. A 
programmer may execute this instruction before a real scatter instruction to improve its 
performance. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. This instructions does not 
modify any kind of architectural state (including the write-mask). 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element before up-conversion. 


Operation 


// Use mu; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (k1[n] != @) { 
1 = 32&n 
// mu,Ln] = BASE_ADDR + SignExtend(VINDEXLi+31:i] * SCALE) 
pointer[63:0] = mv,Ln] 
HintPointer (pointer) 
} 
} 


SIMD Floating-Point Exceptions 


None. 
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Memory Up-conversion: U ;32 


S2515o || Function: Usage disp8*N 
000 no conversion [rax] 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float16 to float32 [rax] {float16} 2 

100 uint8 to float32 [rax] {uint8} 1 

101 sint8 to float32 [rax] {sint8} 1 

110 uint16 to float32 [rax] {uint16} 2 

111 sint16 to float32 [rax] {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


None 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 


#NM 
#UD 
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Instruction not available in these modes 


If CRO.TS[bit 3]=1. 


If processor model does not implement the specific instruction. 
If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If using a 16 bit effective address. 
If ModRM.rm is different than 100b. 


If no write mask is provided or selected write-mask is k0. 
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VSCATTERPF1DPS - Scatter Prefetch Float32 Vector With Signed Dword 
Indices Into L2 


Opcode Instruction Description 
MVEX.512.66.0F38.W0 C6 vscatterpfidps Ur32(mv;) {k1} Scatter Prefetch float32 vector U32(mv;), us- 
/6 /vsib ing doubleword indices with T1 hint, under 
write-mask. 
Description 


Prefetches into the L2 level of cache the memory locations pointed by base address 
BASE_ADDR and doubleword index vector VI[NDEX, with scale SCALE, with re- 
quest for ownership (exclusive). Down-conversion operand specifies the granularity used 
by compilers to better encode the instruction if a displacement, using disp8*N feature, is 
provided when specifying the address. If any memory access causes any type of mem- 
ory exception, the memory access will be considered as completed (destination mask up- 
dated) and the exception ignored. Down-conversion parameter is optional, and it is used 
to correctly encode disp8*N. 


Note the special mask behavior as only a subset of the active elements of write mask k1 
are actually operated on (as denoted by function SELECT_SUBSET). There are only 
two guarantees about the function: (a) the destination mask is a subset of the source mask 
(identity is included), and (b) on a given invocation of the instruction, at least one element 
(the least significant enabled mask bit) will be selected from the source mask. 


Programmers should always enforce the execution of a gather/scatter instruction to be 
re-executed (via a loop) until the full completion of the sequence (i.e. all elements of the 
prefetch sequence have been prefetched and hence, the write-mask bits all are zero). 


This instruction has special disp8*N and alignment rules. N is considered to be the size 
of a single vector element after down-conversion. 


Note also the special mask behavior as the corresponding bits in write mask k1 are reset 
with each destination element being updated according to the subset of write mask k1. 
This is useful to allow conditional re-trigger of the instruction until all the elements from 
a given write mask have been successfully stored. 


Note that both gather and scatter prefetches set the access bit (A) in the related TLB page 
entry. Scatter prefetches (which prefetch data with RFO) do not set the dirty bit (D). 


Operation 


// instruction works over a subset of the write mask 
ktemp = SELECT_SUBSET(k1) 


exclusive = 1 
evicthintpre = MVEX.EH 
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// Use mv; as vector memory operand (VSIB) 
for (n = @; n < 16; n++) { 
if (ktemp[n] != @) { 
1 = 32&n 
// mu,Ln] = BASE_ADDR + SignExtend(VINDEX[i+31:i] * SCALE) 
pointer[63:0] = mu;L[n] 
FetchL2cacheLine(pointer, exclusive, evicthintpre) 
k1[n] = @ 


SIMD Floating-Point Exceptions 


None. 


Memory Down-conversion: D;35 


S25159 || Function: Usage disp8*N 
000 no conversion zmm1 4 

001 reserved N/A N/A 
010 reserved N/A N/A 
011 float32 to float16 zmm1 {float16} 2 

100 float32 to uint8 zmm1 {uint8} 1 

101 float32 to sint8 zmm1 {sint8} 1 

110 float32 to uint16 zmm1 {uint16} 2 

111 float32 to sint16 zmm1 {sint16} 2 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm512_prefetch_i32extscatter_ps (void*, _m512i, _MM_UPCONV_PS_ENUM, 
int, int); 

void _mm512_mask_prefetch_i32extscatter_ps(void*, __mmask16, _m512i, 
_MM_UPCONV_PS_ENUM, int, int); 

void _mm512_prefetch_i32scatter_ps(void*, _m512i, int, int); 

void _mm512_mask_prefetch_i32scatter_ps(void*, mmask16, _m512i, int, int); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 
64 bit Mode 
#NM If CRO.TS[bit 3]=1. 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 

If using a 16 bit effective address. 

If ModRM.rm is different than 100b. 

If no write mask is provided or selected write-mask is k0. 
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Opcode Instruction Description 

MVEX.NDS.512.66.0FW1 vsubpd zmmi1 {kl}, zmm2, Subtract float64 vector S'rg4(zmm3/m,) from 

5C /r Sea(zmm3/m+) float64 vector zmm2 and store the result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element subtraction from float64 vector zmm2 of the float64 
vector result of the swizzle/broadcast/conversion process on memory or float64 vector 
zmm3. The result is written into float64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 64*n 
// float64 operation 
zmm1[i+63:i] = zmm2[i+63:i] - tmpSrc3[it+63:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 564 


$2519 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_sub_pd (_m512d,_m512d); 
_—m512d _mm512_mask_sub_pd (_m512d,_mmask8,_m512d,_m512d); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 


612 


Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


Reference Number: 327364-001 


= 
=r 
(3 


CHAPTER 6. INSTRUCTION DESCRIPTIONS 


VSUBPS - Subtract Float32 Vectors 


Opcode Instruction Description 
MVEX.NDS.512.0RW0O5C/r vsubps zmm1 {kl}, zmm2, Subtract float32 vector S+39(zmm3/m,) from 
S'f32(zmm3/mz) float32 vector zmm2 and store the result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element subtraction from float32 vector zmm2 of the float32 
vector result of the swizzle/broadcast/conversion process on memory or float32 vector 
zmm3. The result is written into float32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
1 = 32an 
// float32 operation 
zmm1[i+31:i] = zmm2[i+31:i] - tmpSrc3[i+31:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny 


Results To Zero: 


(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 32 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S -35 


MVEX.EH=0 

S251Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_sub_ps (_m512,_m512); 
—m512 _mm512_mask_sub_ps (_m512,_mmask16,__m512,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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Opcode Instruction Description 

MVEX.NDS.512.66.0F38.W1 vsubrpd zmm1 {k1}, zmm2, Subtract float64 vector zmm2 from float64 vec- 

6D /r Sea(zmm3/m4) tor Sg4(zmm3/m;) and store the result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element subtraction of float64 vector zmm2 from the float64 
vector result of the swizzle/broadcast/conversion process on memory or float64 vector 
zmm3. The result is written into float64 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad fg4(zmm3/m) 
} 


for (n = @; n < 8; n++) { 
if(k1[n] != 0) { 
i = 64*n 
// float64 operation 
zmm1[i+63:i] = -zmm2[i+63:i] + tmpSrc3[i+63:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny Results To Zero : 
(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 564 


$2519 || Function: Usage disp8*N 
000 no conversion [rax] {8to8} or [rax] 64 

001 broadcast 1 element (x8) [rax] {1to8} 8 

010 broadcast 4 elements (x2) | [rax] {4to8} 32 

011 reserved N/A N/A 
100 reserved N/A N/A 
101 reserved N/A N/A 
110 reserved N/A N/A 
111 reserved N/A N/A 


Register Swizzle: S r¢4 


MVEX.EH=0 

S2S1Spo || Function: 4 x 64 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


_—m512d _mm512_subr_pd (_m512d,_m512d); 
—m512d _mm512_mask_subr_pd (_m512d,_mmask8,_m512d,_m512d); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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VSUBRPS - Reverse Subtract Float32 Vectors 


Opcode Instruction Description 

MVEX.NDS.512.66.0F38.WO vsubrps zmm1 {k1}, zmm2, Subtract float32 vector zmm2 from float32 vec- 

6D /r S'f32(zmm3/mz) tor S32(zmm3/m;) and store the result in 
zmm1, under write-mask. 


Description 


Performs an element-by-element subtraction of float32 vector zmm2 from the float32 
vector result of the swizzle/broadcast/conversion process on memory or float32 vector 
zmm3. The result is written into float32 vector zmm1. 


This instruction is write-masked, so only those elements with the corresponding bit set 
in vector mask register k1 are computed and stored into zmm1. Elements in zmm1 with 
the corresponding bit clear in k1 retain their previous values. 


Operation 


if(source is a register operand and MVEX.EH bit is 1) { 
if(SSS[2]==1) Supress_Exception_Flags() // SAE 
// SSS are bits 6-4 from the MVEX prefix encoding. For more details, see Ta- 
ble 2.14 
RoundingMode = SSS[1:0] 
tmpSrc3[511:0] = zmm3[511:0] 
} else { 
RoundingMode = MXCSR.RC 
tmpSrc3[511:0] = SwizzUpConvLoad ¢32 (zmm3/m) 
} 


for (n = @; n < 16; n++) { 
if(k1[n] != 0) { 
i = 32x*n 
// float32 operation 
zmm1[i+31:i] = -zmm2[i+31:i] + tmpSrc3[i+31:i] 
} 
} 


SIMD Floating-Point Exceptions 


Overflow, Underflow, Invalid, Precision, Denormal. 
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Denormal Handling 


Treat Input Denormals As Zeros : 
(MXCSR.DAZ)? YES : NO 


Flush Tiny 


Results To Zero: 


(MXCSR.FZ)? YES : NO 


Memory Up-conversion: S 32 


S25159 || Function: Usage disp8*N 
000 no conversion [rax] {16to16} or [rax] 64 

001 broadcast 1 element (x16) | [rax] {1to16} 4 

010 broadcast 4 elements (x4) | [rax] {4to16} 16 

011 float16 to float32 [rax] {float16} 32 

100 uint8 to float32 [rax] {uint8} 16 

110 uint16 to float32 [rax] {uint16} 32 

111 sint16 to float32 [rax] {sint16} 32 


Register Swizzle: S -35 


MVEX.EH=0 

S251Spo || Function: 4 x 32 bits Usage 

000 no swizzle zmm0 or zmm0 {dcba} 
001 swap (inner) pairs zmm0 {cdab} 
010 swap with two-away zmm0 {badc} 
011 cross-product swizzle zmm0 {dacb} 
100 broadcast a element zmm0 {aaaa} 
101 broadcast b element zmm0 {bbbb} 
110 broadcast c element zmm0 {cccc} 
111 broadcast d element zmm0 {dddd} 
MVEX.EH=1 

525159 || Rounding Mode Override Usage 

000 Round To Nearest (even) , {rn} 

001 Round Down (-INF) , {rd} 

010 Round Up (+INF) , {ru} 

011 Round Toward Zero , {rz} 

100 Round To Nearest (even) with SAE , {rn-sae} 

101 Round Down (-INF) with SAE , {rd-sae} 

110 Round Up (+INF) with SAE , {ru-sae} 

111 Round Toward Zero with SAE , {rz-sae} 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


—m512 _mm512_subr_ps (_m512,_m512); 
—m512 _mm512_mask_subr_ps (_m512,__mmask16,_m512,__m512); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 


64 bit Mode 
#SS(0) 


#GP(0) 


#PF(fault-code) 
#NM 
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Instruction not available in these modes 


If a memory address referencing the SS segment is 
in a non-canonical form. 

If the memory address is in a non-canonical form. 

If a memory operand linear address is not aligned 

to the data size granularity dictated by SwizzUpConv 
mode. 

For a page fault. 

If CRO.TS[bit 3]=1. 

If preceded by any REX, FO, F2, F3, or 66 prefixes. 


621 


APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS 


Appendix A 


Scalar Instruction Descriptions 


In this Chapter all the special scalar instructions introduced with the Knights Corner instruction set are de- 
scribed. 
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CLEVICTO - Evict L1 line 


Opcode Instruction Description 
VEX.128.F2.0F AE /7 clevictO m8 = Evict memory line from L1 in m8 using TO hint. 
MVEX.512.F2.0F AE /7 clevictO0m8 Evict memory line from L1 in m8 using TO hint. 


Description 


Invalidates from the first-level cache the cache line containing the specified linear address 
(updating accordingly the cache hierarchy if the line is dirty). Note that, unlike CLFLUSH, 
the invalidation is not broadcasted throughout the cache coherence domain. 


The MVEX form of this instruction uses disp8*64 addressing. Displacements that would 
normally be 8 bits according to the ModR/M byte are still 8 bits but scaled by 64 so that 
they have cache-line granularity. VEX forms of this instruction uses regular disp8 address- 
ing. 


This instruction is a hint intended for performance and may be speculative, thus may be 
dropped or specify invalid addresses without causing problems. The instruction does not 
produce any type of memory-related fault. 


Operation 


FlushL1CacheLine(linear_address) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_clevict (const void*, int); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If operand is not amemory location. 
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CLEVICT1 - Evict L2 line 


Opcode Instruction Description 
VEX.128.F3.0F AE /7 clevictl m8 —_Evict memory line from L2 in m8 using T1 hint. 
MVEX.512.F3.0F AE /7 clevictlm8 Evict memory line from L2 in m8 using T1 hint. 


Description 


Invalidates from the second-level cache the cache line containing the specified linear ad- 
dress (updating accordingly the cache hierarchy if the line is dirty). Note that, unlike 
CLFLUSH, the invalidation is not broadcasted throughout the cache coherence domain. 


The MVEX form of this instruction uses disp8*64 addressing. Displacements that would 
normally be 8 bits according to the ModR/M byte are still 8 bits but scaled by 64 so that 
they have cache-line granularity. VEX forms of this instruction uses regular disp8 address- 


ing. 


This instruction is a hint intended for performance and may be speculative, thus may be 
dropped or specify invalid addresses without causing problems. The instruction does not 
produce any type of memory-related fault. 


Operation 


FlushL2CacheLine(linear_address) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_clevict (const void*, int); 
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Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If operand is not amemory location. 
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DELAY - Stall Thread 


Opcode Instruction Description 
VEX.128.F3.0R WO AE /6 = delay r32 Stall Thread using r32. 
VEX.128.F3.0FW1 AE /6_ delay r64 Stall Thread using r64. 


Description 


Hints that the processor should not fetch/issue instructions for the current thread for the 
specified number of clock cycles in register source. The maximum number of clock cycles 
is limited to 2°? — 1 (32 bit counter). The instructions is speculative and could be executed 
as a NOP by a given processor implementation. 


Any of the following events will cause the processor to start fetching instructions for the 
delayed thread again: the counter counting down to zero, an NMI or SMI, a debug excep- 
tion, amachine check exception, the BINIT# signal, the INIT# signal, or the RESET# signal. 
The instruction may exit prematurely due to any interrupt (e.g. an interrupt on another 
thread on the same core). 


This instruction must properly handle the case where the current clock count turns over. 
This can be accomplished by performing the subtraction shown below and treating the 
result as an unsigned number. 


This instruction should prevent the issuing of additional instructions on the issuing thread 
as soon as possible, to avoid the otherwise likely case where another instruction on the 
same thread that was issued 3 or 4 clocks later has to be killed, creating a pipeline bubble. 


If, on any given clock, all threads are non-runnable, then any that are non-runnable due 
to the execution of DELAY may or may not be treated as runnable threads. 


Notes about Knights Corner implementation: 


¢ In Knights Corner, the processor won't execute from a "delayed" thread before the 
delay counter has expired, even if there are non-runnable threads at any given point 
in time. 


Operation 


START_CLOCK = CURRENT_CLOCK_COUNT 
DELAY_SLOTS = SRC 
if (DELAY_SLOTS > @xFFFFFFFF) DELAY_SLOTS = @xFFFFFFFF 
while ( (CURRENT_CLOCK_COUNT - START_CLOCK) < DELAY_SLOTS ) 
{ 

xavoid fetching/issuing from the current threadx 


J 
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Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_delay_32 (unsigned int); 
void _mm_delay_64 (unsigned _int64); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If operand is a memory location. 
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LZCNT - Leading Zero Count 


Opcode Instruction Description 

VEX.128.F3.0F WO BD /r_ Izcntr32,r32 Count the number of leading bits set to 0 in r32 (src), leaving the 
result in r32 (dst). 

VEX.128.F3.0FW1 BD /r_ l|zcntr64,r64 Count the number of leading bits set to 0 in r64 (src), leaving the 
result in r64 (dst). 


Description 


Counts the number of leading most significant zero bits in a source operand (second 
operand) returning the result into a destination (first operand). 


LZCNT is an extension of the BSR instruction. The key difference between LZCNT and BSR 
is that LZCNT provides operand size as output when source operand is zero, while in the 
case of BSR instruction, if source operand is zero, the content of destination operand are 
undefined. 


ZF flag is set when the most significant set bit is bit OSIZE-1. CF is set when the source 
has no set bit. 


Operation 


temp = OPERAND_SIZE - 1 


DEST = @ 
while( (temp >= @) AND (SRC[temp] == Q) ) 
{ 
temp = temp - 1 
DEST = DEST + 1 
} 
if(DEST == OPERAND_SIZE) { 
CF = 1 
} else { 
CF = 0 
} 
if(DEST == 0) ZF = 1 
} else { 
ZF = @ 
} 
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Flags Affected 


¢ ZF flag is set to 1 in case of zero output (most significant bit of the source is set), and 
to 0 otherwise 


¢ CF flag is set to 1 if input was zero and cleared otherwise. 
e The PE OF AF and SF flags are set to 0 


Intel’ C/C++ Compiler Intrinsic Equivalent 


unsignedint _lzcnt_u32 (unsigned int); 
_int64 _lzcnt_u64 (unsigned _int64); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If second operand is a memory location. 


630 Reference Number: 327364-001 


= 
=r 
(3 


APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS 


POPCNT - Return the Count of Number of Bits Set to 1 


Opcode 


Instruction Description 


VEX.128.F3.0F WO B8 /r popcntr32,r32 Count the number of bits set to 1 in r32 (src), leaving the result 


in r32 (dst). 


VEX.128.F3.0FW1B8/r popcntr64,r64 Count the number of bits set to 1 in r64 (src), leaving the result 


in r64 (dst). 


Operation 


tmp = Q 


for (i=; i<OPERAND_SIZE; i++) 


if(SRCLi] == 1) tmp = tmp + 1 


i; 
DEST = tmp 
Flags Affected 


¢ The ZF flag is set according to the result (if SRC==0) 
e The OF SF, AF, CF and PF flags are set to 0 


Intel’ C/C++ Compiler Intrinsic Equivalent 


unsigned int 
_int64 


Exceptions 


_mm_popcnt_u32 (unsigned int); 
_mm_popcnt_u64 (unsigned __int64); 


Real-Address Mode and Virtual-8086 


#UD 


Instruction not available in these modes 


Protected and Compatibility Mode 


#UD 
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64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If second operand is a memory location. 
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SPFLT - Set performance monitor filtering mask 


Opcode Instruction Description 

VEX.128.F2.0F WO AE /6 — spflt r32 Set performance monitoring filtering mask using r32. 

VEX.128.F2.0RW1 AE /6 — spflt r64 Set performance monitoring filtering mask using r64. 
Description 


SPFLT enables/disables performance monitoring on the currently executing thread only 
based on the LSB value of the source. 


SPFLT instruction is a model specific instruction and is not part of Intel® Architecture. 
The bit(s) and register(s) modified are model-specific and may vary by processor imple- 
mentation. 


The PERF_SPFLT_CTRL model-specific register modified by SPFLT instruction may also 
be read / modified with the RDMSR / WRMSR instructions, when executing at privilege 
level 0. 


The PERF_SPFLT_CTRL MSR is thread specific. SPFLT execution moves LSB of source 
(EAX) into the USR_PREF bit (bit 63) in the PERF_SPFLT_CTRL MSR. The lower N bits, 
called CNTR_x_SPFLT_EN (bits N-1:0, 1 per counter), in PERF_SPFLT_CTRL MSR control 
whether the USR_PREF bit affects enabling of performance monitoring for the corre- 
sponding counter. 


SPFLT instruction does not modify the CNTR_x_SPFLT_EN bits, where as RDMSR and 
WRMSR read / modify all bits of the PERF_SPFLT_CTRL MSR. 


Enabling Performance countering 


On a per thread basis, a performance monitoring counter n is incremented if, and only if: 


1. PERF_GLOBAL_CTRL[n] is set to 1 

2. 1A32 PerfEvtSeln[22] is set to 1 (where 'n' is the enabled counter) 

3. PERF_SPFLT_CTRL[n] is set to 0, or, PERF_SPFLT_CTRL[63] (USR_PREF) is set to 1. 
4. The desired event is asserted for thread id T 
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MSR address Per-thread? Name 

2Fh Y PERF_GLOBAL_CTRL 
Bit 1: Enable IA32_PerfEvtSel1 
Bit 0: Enable IA32_PerfEvtSel0 


28h Y 1A32_PerfEvtSel0 
Bit 22: Enable counter 0 


29h Y 1A32_PerfEvtSel1 
Bit 22: Enable counter 1 


2Ch Y PERF_SPFLT_CTRL 
Bit 63: User Preference (USR_PREF). 
Bit 1: Counter 1 SPFLT Enable. Controls whether USR_PREF 
is used in enabling performance monitoring for counter 1 


Bit 0: Counter 0 SPFLT Enable. Controls whether USR_PREF 
is used in enabling performance monitoring for counter 0 


Operation 


(* i is the thread ID of the current executing thread *) 
PerfFilterMask[i][@] = SRC[Q]; 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_spflt_32 (unsigned int); 
void _mm_spflt_64 (unsigned _ int64); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 
#UD Instruction not available in these modes 
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64 bit Mode 


#UD 
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If preceded by any REX, FO, F2, F3, or 66 prefixes. 
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TZCNT - Trailing Zero Count 


Opcode 
VEX.128.F3.0F WO BC /r 


VEX.128.F3.0F.W1 BC /r 


Instruction 
tzcnt r32, r32 


tzcnt r64, r64 


Description 


Count the number of trailing bits set to 0 in r32 (src), leaving the 


result in r32 (dst). 


Count the number of trailing bits set to 0 in r64 (src), leaving the 


result in r64 (dst). 


Description 


Searches the source operand (second operand) for the least significant set bit (1 bit). Ifa 
least significant 1 bit is found, its bit index is stored in the destination operand; otherwise, 
the destination operand is set to the operand size. 


ZF flag is set when the least significant set bit is bit 0. CF is set when the source has no set 


bit. 
Operation 
index = @ 
if( SRCLOPERAND_SIZE-1:0] == @ ) 
{ 
DEST = OPERAND_SIZE 
CF = 1 
} 
else 
{ 
while(SRCLindex] == Q) 
{ 
index = 
} 
DEST = index 
CF = Q 
} 
Flags Affected 


index+1 


e The ZF is set according to the result 
e The CF is set if SRC is zero 
e The PE OF AF and SF flags are set to 0 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


unsignedint _tzcnt_u32 (unsigned int); 
_int64 _tzcnt_u64 (unsigned _int64); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If second operand is a memory location. 
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TZCNTI - Initialized Trailing Zero Count 
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Opcode Instruction Description 
VEX.128.F2.0F WO BC /r_ tzcntir32,r32 Count the number of trailing bits set to 0 between r32 (dst) and 


r32 (src). 


VEX.128.F2.0FW1 BC /r_ tzcntir64,r64 Count the number of trailing bits set to 0 between r64 (dst) and 


r64 (src). 


Description 


Searches the source operand (second operand) for the least significant set bit (1 bit) 
greater than bit DEST (where DEST is the destination operand, the first operand). Ifaleast 
significant 1 bit is found, its bit index is stored in the destination operand ; otherwise, the 
destination operand is set to the operand size. The value of DEST is a signed offset from 
bit 0 of the source operand. Any negative DEST value will produce a search starting from 
bit 0, like TZCNT. Any DEST value equal to or greater than (OPERAND_SIZE-1) will cause 
the destination operand to be set to the operand size. 


This instruction allows continuation of searches through bit vectors without having to 
mask off each least significant 1-bit before restarting, as is required with TZCNT. 


The functionality of this instruction is exactly the same as for the TZCNT instruction, ex- 
cept that the search starts at bit DEST+1 rather than bit 0. 


CF is set when the specified index goes beyond the operand size or there is no set bit 
between the index and the MSB bit of the source. 


Operation 


// DEST is a signed operand, no overflow 
if (DESTLOSIZE-1:0] < @) index = @ 
else index = DEST + 1 


if( ( index > OPERAND_SIZE-1 ) || ( SRC[OPERAND_SIZE-1:index] == @ ) ) 


DEST = OPERAND_SIZE 


CF=1 
} 
else 
{ 
while(SRCLindex] == 0) 
{ 
index = index+1 
} 
DEST = index 
CF=0 
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Flags Affected 


e The ZF is set according to the result 


¢ The CFissetifSRC is zero betwen index and MSB, or index is greater than the operand 
size. 


e The PE OF AF and SF flags are set to 0 


Intel’ C/C++ Compiler Intrinsic Equivalent 


int _mm_tzcnti_32 (int, unsigned int); 
_int64 _mm _tzcnti_64 (_int64, unsigned _int64); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 
If second operand is a memory location. 
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VPREFETCHO - Prefetch memory line using TO hint 


Opcode Instruction Description 


VEX.128.0F 18 /1 vprefetchO m8 __ Prefetch memory line in m8 using TO hint. 
MVEX.512.0F 18/1 vprefetchOm8  Prefetch memory line in m8 using TO hint. 


Description 


640 


This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, the MVEX form of this instruction uses 
disp8*64 addressing. Displacements that would normally be 8 bits according to the 
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity. 
VEX forms of this instruction uses regular disp8 addressing. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modified 
in the L1 cache). 

nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal 
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be 
cached normally in the L2 and higher caches. 


Note that in Knights Corner, the hardware drops VPREFETCH if it hits L1 (so it becomes 
transparent to L2). Consequently, this instructon is not a good solution to avoid hot 
L1/cold L2 performance problems. Prefetches set the access bit (A) in the related TLB 
page entry, but prefetches with exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = @ 
nthintpre = Q 
FetchL1CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 
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VPREFETCH1 - Prefetch memory line using T1 hint 


Opcode Instruction Description 


VEX.128.0F 18 /2 vprefetch1 m8 Prefetch memory line in m8 using T1 hint. 
MVEX.512.0F 18/2 vprefetchim8  Prefetch memory line in m8 using T1 hint. 


Description 
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This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, the MVEX form of this instruction uses 
disp8*64 addressing. Displacements that would normally be 8 bits according to the 
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity. 
VEX forms of this instruction uses regular disp8 addressing. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modified 
in the L2 cache). 

nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal 
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be 
cached normally in the L2 and higher caches. 


Note that in Knights Corner, the hardware drops VPREFETCH if it hits L1 (so it becomes 
transparent to L2). Consequently, this instructon is not a good solution to avoid hot 
L1/cold L2 performance problems. Prefetches set the access bit (A) in the related TLB 
page entry, but prefetches with exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = @ 
nthintpre = 0 
FetchL2CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 
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VPREFETCH2 - Prefetch memory line using T2 hint 


Opcode Instruction Description 


VEX.128.0F 18 /3 vprefetch2 m8 __ Prefetch memory line in m8 using T2 hint. 
MVEX.512.0F 18/3 vprefetch2 m8 _ Prefetch memory line in m8 using T2 hint. 


Description 
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This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, the MVEX form of this instruction uses 
disp8*64 addressing. Displacements that would normally be 8 bits according to the 
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity. 
VEX forms of this instruction uses regular disp8 addressing. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modified 
in the L2 cache). 

nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal 
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be 
cached normally in the L2 and higher caches. 


Note that in Knights Corner, the hardware drops VPREFETCH if it hits L1 (so it becomes 
transparent to L2). Consequently, this instructon is not a good solution to avoid hot 
L1/cold L2 performance problems. Prefetches set the access bit (A) in the related TLB 
page entry, but prefetches with exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = @ 
nthintpre = 1 
FetchL2CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 
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Opcode Instruction Description 
VEX.128.0F 18 /5 vprefetcheO m8 __Prefetch memory line in m8 using TO hint with intent to write. 
MVEX.512.0F 18/5 vprefetche0Qm8  Prefetch memory line in m8 using TO hint with intent to write. 


Description 


This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, the MVEX form of this instruction uses 
disp8*64 addressing. Displacements that would normally be 8 bits according to the 
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity. 
VEX forms of this instruction uses regular disp8 addressing. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modified 
in the L1 cache). 

nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal 
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be 
cached normally in the L2 and higher caches. 


In Knights Corner, the hardware drops VPREFETCH if it hits L1 (so it becomes transpar- 
ent to L2). Consequently, this instructon is not a good solution to avoid hot L1/cold L2 
performance problems. Prefetches set the access bit (A) in the related TLB page entry, 
but prefetches with exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = 1 
nthintpre = @ 
FetchL1CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 


Reference Number: 327364-001 


647 


APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS 


VPREFETCHE1 - Prefetch memory line using T1 hint, with intent to write 


648 


Opcode Instruction Description 
VEX.128.0F 18 /6 vprefetche1 m8 _Prefetch memory line in m8 using T1 hint with intent to write. 
MVEX.512.0F 18/6 vprefetchelm8  Prefetch memory line in m8 using T1 hint with intent to write. 


Description 


This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, the MVEX form of this instruction uses 
disp8*64 addressing. Displacements that would normally be 8 bits according to the 
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity. 
VEX forms of this instruction uses regular disp8 addressing. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modified 
in the L2 cache). 

nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal 
cache. The data will be loaded in the #TIDth way and making the data MRU. Data 
should still be cached normally in the L2 and higher caches. 


The hardware drops VPREFETCH if it hits L1 (so it becomes transparent to L2). Conse- 
quently, this instructon is not a good solution to avoid hot L1/cold L2 performance prob- 
lems. Prefetches set the access bit (A) in the related TLB page entry, but prefetches with 
exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = 1 
nthintpre = @ 
FetchL2CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 
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Opcode Instruction Description 
VEX.128.0F 18 /7 vprefetche2 m8 _Prefetch memory line in m8 using T2 hint with intent to write. 
MVEX.512.0F 18/7 vprefetche2 m8  Prefetch memory line in m8 using T2 hint with intent to write. 


Description 


This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, the MVEX form of this instruction uses 
disp8*64 addressing. Displacements that would normally be 8 bits according to the 
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity. 
VEX forms of this instruction uses regular disp8 addressing. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L2 cache (unless it's already Exclusive or Modified 
in the L2 cache). 

nthintpre (NTH): load data into the L2 nontemporal cache rather than the L2 temporal 
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be 
cached normally in the L2 and higher caches. 


Note that in Knights Corner, the hardware drops VPREFETCH if it hits L1 (so it becomes 
transparent to L2). Consequently, this instructon is not a good solution to avoid hot 
L1/cold L2 performance problems. Prefetches set the access bit (A) in the related TLB 
page entry, but prefetches with exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = 1 
nthintpre = 1 
FetchL2CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 
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Opcode Instruction Description 
VEX.128.0F 18 /4 vprefetchentam8  Prefetch memory line in m8 using NTA hint with intent to write. 


MVEX.512.0F 18/4 vprefetchentam8  Prefetch memory line in m8 using NTA hint with intent to write. 


Description 


This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, this instruction uses disp8*64 address- 
ing. Displacements that would normally be 8 bits according to the ModR/M byte are still 
8 bits but scaled by 64 so that they have cache-line granularity. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modified 
in the L1 cache). 

nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal 
cache. The data will be loaded in the #TIDth way and making the data MRU. Data 
should still be cached normally in the L2 and higher caches. 


The hardware drops VPREFETCH if it hits L1 (so it becomes transparent to L2). Conse- 
quently, this instructon is not a good solution to avoid hot L1/cold L2 performance prob- 
lems. Prefetches set the access bit (A) in the related TLB page entry, but prefetches with 
exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = 1 
nthintpre = 1 
FetchL1CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 
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APPENDIX A. SCALAR INSTRUCTION DESCRIPTIONS 


VPREFETCHNTA - Prefetch memory line using NTA hint 


Opcode Instruction Description 
VEX.128.0F 18 /0 vprefetchntam8  Prefetch memory line in m8 using NTA hint. 
MVEX.512.0F 18/0 vprefetchntam8  Prefetch memory line in m8 using NTA hint. 


Description 


This is very similar to the existing IA-32 prefetch instruction, PREFETCHO, as described 
in IA-32 Intel® Architecture Software Developer's Manual: Volume 2. If the line selected is 
already present in the cache hierarchy at a level closer to the processor, no data movement 
occurs. Prefetches from uncacheable or WC memory are ignored. 


In contrast with the existing prefetch instruction, the MVEX form of this instruction uses 
disp8*64 addressing. Displacements that would normally be 8 bits according to the 
ModR/M byte are still 8 bits but scaled by 64 so that they have cache-line granularity. 
VEX forms of this instruction uses regular disp8 addressing. 


This instruction is a hint and may be speculative, and may be dropped or specify invalid 
addresses without causing problems or memory related faults. 


This instruction contains a set of hint attributes that modify the prefetching behavior: 


exclusive: make line Exclusive in the L1 cache (unless it's already Exclusive or Modified 
in the L1 cache). 

nthintpre (NTH): load data into the L1 nontemporal cache rather than the L1 temporal 
cache. Data will be loaded in the #TIDth way and made MRU. Data should still be 
cached normally in the L2 and higher caches. 


In Knights Corner, the hardware drops VPREFETCH if it hits L1 (so it becomes transpar- 
ent to L2). Consequently, this instructon is not a good solution to avoid hot L1/cold L2 
performance problems. Prefetches set the access bit (A) in the related TLB page entry, 
but prefetches with exclusive access (RFO) do not set the dirty bit (D). 


PREFETCH Hint equivalence for Knights Corner hardware 


Instruction Cache Level Non-temporal Bring as exclusive 
VPREFETCHO L1 NO NO 
VPREFETCHNTA L1 YES NO 

VPREFETCH1 L2 NO NO 
VPREFETCH2 L2 YES NO 
VPREFETCHEO L1 NO YES 
VPREFETCHENTA L1 YES YES 
VPREFETCHE1 L2 NO YES 
VPREFETCHE2 L2 YES YES 
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Operation 


exclusive = @ 
nthintpre = 1 
FetchL1CacheLine(effective_address, exclusive, nthintpre) 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_prefetch (char const’, int); 


Exceptions 


Real-Address Mode and Virtual-8086 


#UD Instruction not available in these modes 


Protected and Compatibility Mode 


#UD Instruction not available in these modes 


64 bit Mode 


If preceded by any REX, FO, F2, F3, or 66 prefixes. 


If operand is not amemory location. 
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Appendix B 


Knights Corner 64 bit Mode Scalar Instruc- 


tion Support 


In 64 bit mode, Knights Corner supports a subset of the Intel 64 Architecture instructions. The 64 bit mode 
instructions supported by Knights Corner are listed in this chapter. 


B.1 64 bit Mode General-Purpose and X87 Instructions 


Knights Corner supports most of the general-purpose register (GPR) and X87 instructions in 64 bit mode. They 


are listed in Table B.2. 


64 bit Mode GPR and X87 Instructions in Knights Corner: 


ADC ADD AND BSF BSR 
BSWAP BT BTC BTR BTS 
CALL CBW CDQ CDQE CLC 
CLD CLI CLTS CMC CMP 
CMPS CMPSB CMPSD CMPSQ CMPSW 
CMPXCHG | CMPXCHG8B | CPUID CQO CWD 
CWDE DEC DIV ENTER F2XM1 
FABS FADD FADDP FBLD FBSTP 
FCHS FCLEX FCOM FCOMP FCOMPP 
FCOS FDECSTP FDIV FDIVP FDIVR 
FDIVRP FFREE FIADD FICOM FICOMP 
FIDIV FIDIVR FILD FIMUL FINCSTP 
FINIT FIST FISTP FISUB FISUBR 
FLD FLD1 FLDCW FLDENV | FLDL2E 
FLDL2T FLDLG2 FLDLN2 FLDPI FLDZ 
FMUL FMULP FNCLEX FNINIT FNOP 
FNSAVE FNSTCW FNSTENV | FNSTSW_ | FPATAN 
FPREM FPREM1 FPTAN FRNDINT | FRESTOR 
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FSAVE FSCALE FSIN FSINCOS | FSQRT 
FST FSTCW FSTENV | FSTP FSTSW 
FSUB FSUBP FSUBR FSUBRP | FTST 
FUCOM FUCOMP FUCOMPP | FWAIT FXAM 
FXCH FXRSTOR FXSAVE | FXTRACT | FYL2X 
FYL2XP1 | HLT IDIV TMUL INC 
INT INT3 INTO INVD INVPLG 
TRET TRETD JA JAE JB 

JBE jc JCXZ JE JECXZ 
jG JGE JL JLE JMP 
JNA JNAE JNB JNBE JNC 
JNE JNG JNGE JNL JNLE 
JNO JNP JNS JNZ jo 

JP JPE JPO Js [Z 
LAHF LAR LEA LEAVE LFS 
LGDT LGS LIDT LLDT LMSW 
LOCK LODS LODSB LODSD | LODSQ 
LODSW | LOOP LOOPE LOOPNE | LOOPNZ 
LOOPZ LSL LSS LTR MOV 
MOVCR | MOVDR MOVS MOVSB. | MOVSD 
Movsq | MOVSW MOVSX MOVSXD | MOVZX 
MUL NEG NOP NOT OR 
POP POPF POPFQ PUSH PUSHF 
PUSHFQ | RCL RCR RDMSM_ | RDPMC 
RDTSC REP REPE REPNE | REPNZ 
RET ROL ROR RSM SAHF 
SAL SAR SBB SCAS SCASB 
SCASD SCASQ SCASW SETA SETAE 
SETB SETBE SETC SETE SETG 
SETGE SETL SETLE SETNA | SETNAE 
SETNB SETNBE SETNG SETNE SETNG 
SETNGE | SETNL SETNLE | SETNO | SETNP 
SETNS SETNZ SETO SETP SETPE 
SETPO SETS SETZ SGDT SHL 
SHLD SHR SHRD SIDT SLDT 
SMSW STC STD STI STOSB 
STOSD STOSQ STOSW STR SUB 
SWAPGS | SYSCALL SYSRET | TEST VERR 
VERW WAIT WBINVD | WRMSR_ | XADD 
XCHG XLAT XLATB XOR UD2 


B.2 Knights Corner 64 bit Mode Limitations 


In 64 bit mode, Knights Corner supports a subset of the Intel 64 Architecture instructions. The following sum- 
marizes Intel 64 Architecture instructions that are not supported in Knights Corner: 
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¢ Instructions that operate on MMX registers 


¢ Instructions that operate on XMM registers 


¢ Instructions that operate on YMM registers 


GPR and X87 Instructions Not Supported in Knights Corner 
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CMOV CMPXCHG16B | FCMOVcc | FCOMI 
FCOMIP FUCOMI FUCOMIP | IN 
INS INSB INSD INSW 
MONITOR | MWAIT OUT OUTS 
OUTSB OUTSD OUTSW PAUSE 
SYSENTER | SYSEXIT 
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B.3_ LDMXCSR - Load MXCSR Register 


Opcode Instruction Description 
OFAE/2 Idmxcsrm32 Load MXCSR register from m32 


Description 


Loads the source operand into the MXCSR control/status register. The source operand is 
a 32 bit memory location. See MXCSR Control and Status Register in Chapter 10, of the 
JA-32 Intel Architecture Software Developers Manual, Volume 1, for a description of the 
MXCSR register and its contents. See chapter 3 of this document for a description of the 
new Knights Corner's MXCSR feature bits. 


The LDMXCSR instruction is typically used in conjunction with the STMXCSR instruction, 
which stores the contents of the MXCSR register in memory. 


The default MXCSR value at reset is 0020_0000H (DUE=1, FZ=0, RC=00, PM=0, UM=0, 
OM=0, ZM=0, DM=0, IM=0, DAZ=0, PE=0, UE=0, OE=0, ZE=0, DE=0, IE=0). 


Any attempt to set to 1 reserved bits in control register MXCSR will produce a #GP fault: 


Bit default Comment 

MXCSR[7-12] 0 Note that this corresponds to Intel® SSE's IM/DM/ZM/OM/UM/PM 
MXCSR[16-20] 0 Reserved 

MXCSR[22-31] 0 Reserved 


Additionally, any attempt to set MXCSR.DUE (bit 21) to 0 will produce a #GP fault: 


Bit default Comment 
MXCSR[21] 1 DUE (Disable Unmasked Exceptions) always enforced in Knights Corner 


This instructions operation is the same in non-64 bit modes and 64 bit mode. 


Operation 


MXCSR = MemLoad(m32) 


Flags Affected 


None. 
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Intel’ C/C++ Compiler Intrinsic Equivalent 


void _mm_setcsr (unsigned int) 


Exceptions 
#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 
#GP(0) If the memory address is in a non-canonical form. 


For an attempt to set reserved bits in MXCSR 
#PF(fault-code) For a page fault. 
#NM If CRO.TS[bit 3] = 1. 
#UD If CRO.EM[bit 2] = 1. 
If CS.L=0 or [A32_EFER.LMA=0. 
If the lock prefix is used. 
#AC(0) If alignment checking is enabled and an unaligned 
memory reference is made while the current privilege 
level is 3. 


660 Reference Number: 327364-001 


(intel. 


APPENDIX B. KNIGHTS CORNER 64 BIT MODE SCALAR INSTRUCTION SUPPORT 


B.4 FXRSTOR - Restore x87 FPU and MXCSR State 


Opcode Instruction Description 
OFAE/1_ fxrstorm512byte Restore the x87 FPU and MXCSR register state from m512byte 


Description 


See Intel64® Intel® Architecture Software Developer's Manual for the description of the 
original x86 instruction. 


Reloads the x87 FPU and the MXCSR state from the 512-byte memory image specified in 
the source operand. This data should have been written to memory previously using the 
FXSAVE instruction, and in the same format as required by the operating modes. The first 
byte of the data should be located on a 16-byte boundary. There are three distinct layout 
of the FXSAVE state map: one for legacy and compatibility mode, a second format for 64 
bit mode with promoted operandsize, and the third format is for 64 bit mode with default 
operand size. 


Knights Corner follows the same layouts as described in Intel64® Intel® Architecture Soft- 
ware Developer's Manual. 


The state image referenced with an FXRSTOR instruction must have been saved using an 
FXSAVE instruction or be in the same format as required by Intel64 Intel® Architecture 
Software Developer's Manual. Referencing a state image saved with an FSAVE, FNSAVE 
instruction or incompatible field layout will result in an incorrect state restoration. 


The FXRSTOR instruction does not flush pending x87 FPU exceptions. To check and raise 
exceptions when loading x87 FPU state information with the FXRSTOR instruction, use an 
FWAIT instruction after the FXRSTOR instruction. 


Note that XMM15-0 registers are logically aliased to the the low 128-bit portions of 
Knights Corner registers V15 through VO (ZMM15-0). Therefore, FXRSTOR must restore 
the contents of the low 128-bit portions of registers V15 through VO. 


Any attempt to set reserved bits in control register MXCSR to 1 will produce a #GP fault: 


Bit default Comment 
MXCSR[7-12] 0 Note that this corresponds to Intel® SSE's IM/DM/ZM/OM/UM/PM 
MXCSR[16-19] 0 Reserved 
MXCSR[20] 0 Reserved 
MXCSR[22-31] 0 Reserved 


Bit 


Additionally, any attempt to set MXCSR.DUE (bit 21) to 0 will produce a #GP fault: 


default Comment 


MXCSR[21] 1 DUE (Disable Unmasked Exceptions) always enforced in Knights Corner 
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Operation 


(x87 FPU, MXCSR, XMM) = MemLoad(SRC) ; 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _fxrstor64 (void*); 


Exceptions 

#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 
If memory operand is not aligned on a 16-byte boundary, 
regardless of segment. 
If trying to set illegal MXCSR values. 

#MF If there is a pending x87 FPU exception. 

#PF(fault-code) For a page fault. 

#UD If CPUID.01H:EDX.FXSR[bit 24] = 0. 
If instruction is preceded by a LOCK prefix. 

#NM If CRO.TS[bit 3] = 1. 
If CRO.EM[bit 2] = 1. 

#AC If this exception is disabled a general protection exception 


(#GP) is signaled if the memory operand is not aligned ona 
16-byte boundary, as described above. If the alignment check 
exception (#AC) is enabled (and the CPL is 3), signaling of 
#AC is not guaranteed and may vary with implementation, as 
follows. In all implementations where #AC is not signaled, a 
general protection exception is signaled in its place. In 
addition, the width of the alignment check may also vary with 
implementation. For instance, for a given implementation, 

an alignment check exception might be signaled for a 2-byte 
misalignment, whereas a general protection exception might 
be signaled for all other misalignments (4-, 8-, or 16-byte 
misalignments). 
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B.5 FXSAVE - Save x87 FPU and MXCSR State 


Opcode Instruction Description 
OFAE/0  fxsavem512byte Save the x87 FPU and MXCSR register state to m512byte 


Description 


See Intel64® Intel® Architecture Software Developer's Manual for the description of the 
original x86 instruction. 


Saves the current state of the x87 FPU, XMM, and MXCSR registers to a 512-byte memory 
location specified in the destination operand. The content layout of the 512 byte region 
depends on whether the processor is operating in non- 64 bit operating modes or 64 bit 
sub-mode of JA-32e mode. 


Bytes 464:511 are available to software use. The processor does not write to bytes 
464:511 of an FXSAVE area. 


Knights Corner follows a similar layout as described in Intel64® Intel® Architecture Soft- 
ware Developer's Manual. 


All bits set to 0 in the MXCSR_MASK value indicate reserved bits in the MXCSR register. 
Thus, ifthe MXCSR_MASK value is ANDd with a value to be written into the MXCSR register, 
the resulting value will be assured of having all its reserved bits set to 0, preventing the 
possibility of a general-protection exception being generated when the value is written to 
the MXCSR register. 


Note that XMM15-0 registers are logically aliased to the the low 128-bit portions of 
Knights Corner registers V15 through VO (ZMM15-0). Therefore, FXSAVE must save the 
contents of the low 128-bit portions of registers V15 through VO. 


Operation 


if(64 bit Mode) 


{ 
if (REX.W == 1) 
{ 
MemStore(m512byte) = Save64BitPromotedFxsave(x87 FPU, XMM15-XMM@, MXCSR); 
} 
else { 
MemStore(m512byte) = Save64BitDefaultFxsave(x87 FPU, XMM15-XMM@, MXCSR); 
} 
} 
else { 


MemStore(m512byte) = SaveLegacyFxsave(x87 FPU, XMM7-XMMQ, MXCSR); 
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Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


void _fxsave64 (void*); 


Exceptions 

#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 
If memory operand is not aligned on a 16-byte boundary, 
regardless of segment. 

#MF If there is a pending x87 FPU exception. 

#PF(fault-code) For a page fault. 

#UD If CPUID.01H:EDX.FXSR[bit 24] = 0. 
If instruction is preceded by a LOCK prefix. 

#NM If CRO.TS[bit 3] = 1. 
If CRO.EM[bit 2] = 1. 

#AC If this exception is disabled a general protection exception 


(#GP) is signaled if the memory operand is not aligned ona 
16-byte boundary, as described above. If the alignment check 
exception (#AC) is enabled (and the CPL is 3), signaling of 
#AC is not guaranteed and may vary with implementation, as 
follows. In all implementations where #AC is not signaled, a 
general protection exception is signaled in its place. In 
addition, the width of the alignment check may also vary with 
implementation. For instance, for a given implementation, 

an alignment check exception might be signaled for a 2-byte 
misalignment, whereas a general protection exception might 
be signaled for all other misalignments (4-, 8-, or 16-byte 
misalignments). 
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B.6 RDPMC - Read Performance-Monitoring Counters 


Opcode Instruction Description 
OF 33 rdpmc Read 


performance- 
monitoring 
counter 
speci- 

fied by 

ECX 

into 
EDX:EAX. 


Description 


Loads the 40-bit performance-monitoring counter specified in the ECX register into reg- 
isters EDX:EAX. The EDX register is loaded with the high-order 8 bits of the counter and 
the EAX register is loaded with the low-order 32 bits. The counter to be read is specified 
with an unsigned integer placed in the ECX register. 


The Knights Corner co-processor has 2 performance monitoring counters per thread, 
specified with 0000H through 0001H, respectively, in the ECX register. 


When in protected or virtual 8086 mode, the performance-monitoring counters enabled 
(PCE) flag in register CR4 restricts the use of the RDPMC instruction as follows. When the 
PCE flag is set, the RDPMC instruction can be executed at any privilege level; when the flag 
is clear, the instruction can only be executed at privilege level 0. (When in real-address 
mode, the RDPMC instruction is always enabled.) 


The performance-monitoring counters can also be read with the RDMSR instruction, 
when executing at privilege level 0. 


The performance-monitoring counters are event counters that can be programmed to 
count events such as the number of instructions decoded, number of interrupts received, 
or number of cache loads. Appendix A, Performance-Monitoring Events, in the IA-32 
Intel® Architecture Software Developers Manual, Volume 3, lists the events that can be 
counted for the Intel® Pentium® 4, Intel Xeon®, and earlier IA-32 processors. 


The RDPMC instruction is not a serializing instruction; that is, it does not imply that all the 
events caused by the preceding instructions have been completed or that events caused 
by subsequent instructions have not begun. If an exact event count is desired, software 
must insert a serializing instruction (such as the CPUID instruction) before and/or after 
the RDPMC instruction. 


The RDPMC instruction can execute in 16 bit addressing mode or virtual-8086 mode; 
however, the full contents of the ECX register are used to select the counter, and the event 
count is stored in the full EAX and EDX registers. 
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The RDPMC instruction was introduced into the IA-32 Architecture in the Intel® Pentium® 
Pro processor and the Intel® Pentium® processor with Intel® MMX”™ technology. The ear- 
lier Intel® Pentium® processors have performance-monitoring counters, but they must be 
read with the RDMSR instruction. 


In 64 bit mode, RDPMC behavior is unchanged from 32 bit mode. The upper 32 bits of 
RAX and RDX are cleared. 


Operation 


if ( ( (ECXL31:0] >= @) && (ECX[31:0] < 2) 
&& ((CR4.PCE = 1) || (CPL = @) || (CRO.PE = Q)) 


) 
{ 
if(64 bit Mode) 
{ 
RAX[31:0] = PMC(ECX[31:0])[31:0]; (* 40-bit read *) 
RAX[63:32] = Q; 
RDX[31:0] = PMC(ECX[31:0])[39:32]; 
RDX[63:32] = Q; 
} 
else 
{ 
EAX = PMC(ECX[31:0])[31:0]; (* 40-bit read *) 
EDX = PMC(ECX[31:0])[39:32]; 
} 
} 
else 
{ 
#GP(Q) 
} 
Flags Affected 
None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


_int64 _rdpmc (int); 
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Exceptions 


TBD 
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Opcode Instruction __ Description 
OFAE/3 stmxcsrm32 Store contents of MXCSR register to m32 


Description 


Stores the contents of the MXCSR control and status register to the destination operand. 


The destination operand is a 32 bit memory location. 


This instructions operation is the same in non-64 bit modes and 64 bit mode. 


Operation 


MemStore(m32) = MXCSR 


Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


unsigned int _mm_getcsr (void) 


Exceptions 

#SS(0) If a memory address referencing the SS segment is 
in a non-canonical form. 

#GP(0) If the memory address is in a non-canonical form. 

#PF(fault-code) For a page fault. 

#NM If CRO.TS[bit 3] = 1. 

#UD If CRO.EM[bit 2] = 1. 
If CS.L=0 or [A32_EFER.LMA=0. 
If the lock prefix is used. 

#AC(0) If alignment checking is enabled and an unaligned 
memory reference is made while the current privilege 
level is 3. 
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B.8 CPUID - CPUID Identification 


Opcode Instruction Description 
OF A2 cpuid Returns processor identification and feature information to the EAX, EBX, ECX, and 
EDX registers, as determined by the input value entered in EAX. 


Description 


The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If 
a software procedure can set and clear this flag, the processor executing the procedure 
supports the CPUID instruction. This instruction operates the same in non-64 bit modes 
and 64 bit mode. 


CPUID returns processor identification and feature information in the EAX, EBX, ECX, and 
EDX registers. The instructions output is dependent on the contents of the EAX register 
upon execution. For example, the following pseudo-code loads EAX with 00H and causes 
CPUID to return a Maximum Return Value and the Vendor Identification String in the ap- 
propriate registers: 


MOV EAX, Q@H 
CPUID 


Table B.4 through B.7 shows information returned, depending on the initial value loaded 
into the EAX register. Table B.3 shows the maximum CPUID input value recognized for 
each family of IA-32 processors on which CPUID is implemented. Since Intel® Pentium® 
4 family of processors, two types of information are returned: basic and extended function 
information. Prior to that, only the basic function information was returned. The first is 
accessed with EAX=0000000xh while the second is accessed with EAX=8000000xh. Ifa 
value is entered for CPUID.EAX that is invalid for a particular processor, the data for the 
highest basic information leaf is returned. 


CPUID can be executed at any privilege level to serialize instruction execution. Serializing 
instruction execution guarantees that any modifications to flags, registers, and memory 
for previous instructions are completed before the next instruction is fetched and exe- 
cuted. 


INPUT EAX = 0: Returns CPUID's Highest Value for Basic Processor Information and 
the Vendor Identification String 


When CPUID executes with EAX set to 0, the processor returns the highest value the CPUID 
recognizes for returning basic processor information. The value is returned in the EAX 
register (see Table B.4 and is processor specific. A vendor identification string is also 
returned in EBX, EDX, and ECX. For Intel® processors, the string is "GenuinelIntel" and is 
expressed: 


EBX = 756e6547h (x "Genu”, with G in the low nibble of BL *) 
EDX = 49656e69h (* "ineI”, with i in the low nibble of DL *) 
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JA-32 Processors Highest Value in EAX 
Basic Information Extended Function In- 
formation 
Earlier Intel486 Processors CPUID Not Imple- CPUID Not Imple- 
mented mented 
Later Intel486 Processors and | 01H Not Implemented 
Intel® Pentium® Processors 
Intel® Pentium® Pro and Intel® | 02H Not Implemented 
Pentium® II Processors, Intel® 
Celeron Processors 
Intel® Pentium® III Processors 03H Not Implemented 
Intel® Pentium® 4 Processors 02H 80000004H 
Intel® Xeon® Processors 02H 80000004H 
Intel® Pentium® M Processor 02H 80000004H 
Intel® Pentium® 4 Processor sup- | 05H 80000008H 
porting Intel® Hyper-Threading 
Technology 
Intel® Pentium® D Processor | 05H 80000008H 
(8xx) 
Intel® Pentium® D Processor | 06H 80000008H 
(9xx) 
Intel® Core™ Duo Processor OAH 80000008H 
Intel® Core” 2 Duo Processor OAH 80000008H 
Intel® Xeon® Processor 3000, | OAH 80000008H 
3200, 5100, 5300 Series 
Knights Corner 04H 80000008H 


Table B.3: Highest CPUID Source Operand for IA-32 Processors 


ECX = 6c65746eh (* "ntel”, with n in the low nibble of CL *) 


INPUT EAX = 1: Returns Model, Family, Stepping Information 


When CPUID executes with EAX set to 1, version information is returned in EAX. Extended 
family, extended model, model, family, and processor type for the processor code-named 
Knights Corner is as follows: 


e Extended Model: 0000B 

e Extended Family: 0000_0000B 
¢ Model: *see table* 

e Family: 1011B 

e Processor Type: 00B 


INPUT EAX = 1: Returns Additional Information in EBX 


When CPUID executes with EAX set to 1, additional information is returned to the EBX 
register: 


e Brand index (low byte of EBX) -- this number provides an entry into a brand string 
table that contains brand strings for IA-32 processors. More information about this 


field is provided later in this section. 
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EAX Information Provided about the Processor 
Basic CPUID Information 
H EAX Maximum Input Value for Basic CPUID Information 1 
EBX "Genu" "Genu" 
ECX "ntel" "ntel" 
EDX = “inel" "inel" 
Basic and Extended Feature Information 
1H 
EAX Version Information: Type, Family, Model, and Stepping 
ID 
Bits 3-0: Stepping Id XXXX 
Bits 7-4: Model 0001B 
Bits 11-8: Family ID 1011B 
Bits 13-12: Type 00B 
Bits 19-16: Extended Model Id 00B 
Bits 27-20: Extended Family Id 00000000B 
EBX _ Bits 7-0: Brand Index 0 
Bits 15-8: CLFLUSH/CLEVICTn line size (Value x 8 = cache | 8 
line size in bytes) 
Bits 23-16: Maximum number of logical processorsinthis | 248 
physical package. 
Bits 31-24: Initial APIC ID XXX 
ECX Extended Feature Information (see Tables B.10) 00000000H 
EDX _ Feature Information (see Tables B.8 and B.9) 110193FFH 
Cache and TLB Information 
2H EAX _ Reserved 0 
EBX _ Reserved 0 
ECX Reserved 0 
EDX _ Reserved 0 
Serial Number Information 
3H EAX _ Reserved 0 
EBX _ Reserved 0 
ECX Reserved 0 
EDX _ Reserved 0 


Table B.4: Information Returned by CPUID Instruction 


e CLFLUSH/CLEVICTn instruction cache line size (second byte of EBX) -- this number 
indicates the size of the cache line flushed with CLEVICT1 instruction in 8-byte incre- 


ments. This field was introduced in the Intel® Pentium® 4 processor. 


¢ Local APIC ID (high byte of EBX) -- this number is the 8-bit ID that is assigned to the 
local APIC on the processor during power up. This field was introduced in the Intel® 


Pentium® 4 processor. 


INPUT EAX = 1: Returns Feature Information in ECX and EDX 


When CPUID executes with EAX set to 1, feature information is returned in ECX and EDX. 
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EAX Information Provided about the Processor Return value 


CPUID leaves > 3 < 80000000 are visible only when 
IA32_MISC_LENABLES.BOOT_NT4 [bit 22] = 0 (default). 


Deterministic Cache Parameters Leaf ECX=0/1/2 


Note: 04H output also depends on the inital value in ECX. 


EAX _ Bits 4-0: Cache Type (0 = Null - No more caches; 1 = Data Cache | 2/1/1 
2 = Instruction Cache, 3 = Unified Cache) 


Bits 7-5: Cache Level (starts at 1) 1/1/2 
Bits 8: Self Initializing cache level (does not need SW initializa- | 1/1/1 
tion) 

Bits 9: Fully Associative cache 0/0/0 
Bits 10: Write-Back Invalidate 0/1/1 
Bits 11: Inclusive (of lower cache levels) 0/1/1 
Bits 13-12: Reserved 0 


Bits 25-14: Maximum number of threads sharing this cache ina | 3/3/3 
physical package (minus one) 
Bits 31-26: Maximum number of processor cores in this physical | 61/61/61 
package (minus one) 


EBX _ Bits 11-00: L = System Coherency Line Size (minus 1) 63/63/63 
Bits 21-12: P = Physical Line partitions (minus 1) 0/0/0 
Bits 31-22: W = Ways of associativity (minus 1) 7/7/7 
ECX S=Number of Sets (minus 1) 63/63/1023 
EDX _ Reserved = 0 0 


Table B.5: Information Returned by CPUID Instruction (Contd.) 


¢ Table B.8 through Table B.9 show encodings for EDX. 
¢ Table B.10 show encodings for ECX. 


For all feature flags, a 1 indicates that the feature is supported. Use Intel® to properly 
interpret feature flags. 


INPUT EAX = 2: Cache and TLB Information Returned in EAX, EBX, ECX, EDX 


Knights Corner considers leaf 2 to be reserved, so no cache and TLB information is re- 
turned when CPUID executes with EAX set to 2. 


INPUT EAX = 3: Serial Number Information 


Knights Corner does not implement Processor Serial Number support, as signalled by fea- 
ture bit CPUID.EAX[01h].EDX.PSN. Therefore, all the returned fields are considered re- 
served. 


INPUT EAX = 4: Returns Deterministic Cache Parameters for Each Level 


When CPUID executes with EAX set to 4 and ECX contains an index value, the processor 
returns encoded data that describe a set of deterministic cache parameters (for the cache 
level associated with the input in ECX). 
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EAX Information Provided about the Processor Return value 
Extended Function CPUID Information 
80000000H| EAX Maximum Input Value for Extended CPUID Information 80000008H 
EBX Reserved 0 
ECX — Reserved 0 
EDX _ Reserved 0 
Feature Information 
80000001H | EAX _ Reserved 0 
EBX _ Reserved 0 
ECX Bit 0: LAHF/SAHF available in 64 bit mode 1 
Bits 31-1: Reserved 0 
EDX Bits 10-0: Reserved 0 
Bit 11: SYSCALL/SYSRET available (in 64 bit mode) 1 
Bits 19-12: Reserved 0 
Bit 20: Execute Disable Bit available 0 
Bits 28-21: Reserved 0 
Bit 29: Intel® 64 Technology available 1 
Bits 31-30: Reserved 0 
Processor Brand String 
80000002H | EAX Processor Brand String 0 
EBX Processor Brand String Continued 0 
ECX Processor Brand String Continued 0 
EDX Processor Brand String Continued 0 
80000003H | EAX Processor Brand String Continued 0 
EBX Processor Brand String Continued 0 
ECX Processor Brand String Continued 0 
EDX Processor Brand String Continued 0 
80000004H | EAX Processor Brand String Continued 0 
EBX Processor Brand String Continued 0 
ECX Processor Brand String Continued 0 
EDX Processor Brand String Continued 0 
Reserved 
80000005H | EAX _ Reserved 0 
EBX Reserved 0 
ECX — Reserved 0 
EDX _ Reserved 0 


Table B.6: Information Returned by CPUID Instruction. 8000000xH leafs. 


Software can enumerate the deterministic cache parameters for each level of the cache hi- 
erarchy starting with an index value of 0, until the parameters report the value associated 
with the cache type field is 0. The architecturally defined fields reported by deterministic 
cache parameters are documented in Table B.5. The associated cache structures described 
by the different ECX descriptors are: 


e ECX=0: Instruction Cache (11) 
e ECX=1: L1 Data Cache (L1) 
e ECX=2: L2 Data Cache (L2) 
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EAX Information Provided about the Processor Return value 
80000006H | EAX Reserved 0 
EBX Reserved 0 
ECX Bits 7-0: L2 cache Line size in bytes 64 
Bits 15-12: L2 associativity field 06H 
Bits 31-16: L2 cache size in 1K units 256 
EDX Reserved 0 
Reserved 
80000007H | EAX Reserved 0 
EBX Reserved 0 
ECX Reserved 0 
EDX Reserved 0 
Virtual/Physical Address size 
80000008H | EAX _ Bits 7-0: #Physical Address Bits 40 
Bits 15-8: #Virtual Address Bits 48 
EBX Reserved 0 
ECX Reserved 0 
EDX Reserved 0 


Table B.7: Information Returned by CPUID Instruction. 8000000xH leafs. (Contd.) 


Operation 


IA32_BIOS_SIGN_ID MSR = Update with installed microcode revision number; 


case (EAX) 
{ 
EAX == Q@: 
EAX = @1H; // Highest basic function CPUID input value 
EBX = "Genu"; 
ECX = "ineI"; 
EDX = "ntel"; 
break; 
EAX = 2H: 
// Cache and TLB information 
EAX = 0; 
EBX = Q; 
ECX = 0; 

EDX = @ 
break; 
EAX = 3H: 

// PSN features 
EAX = @ 
EBX = Q; 
Q 

") 


? 
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// Deterministic Cache Parameters Leaf; 
EAX = xsee tablex 
EBX = xsee tablex 
ECX = xsee tablex 
EDX = xsee tablex 


break; 

EAX = 20000000H; 
EAX = @1H; // Reserved 
EBX = Q; // Reserved 
ECX = @; // Reserved 
EDX = Q; // Reserved 
break; 

EAX = 20000001H; 
EAX = Q; // Reserved 
EBX = Q; // Reserved 
ECX = Q; // Reserved 
EDX = Q0000010H; // Reserved 
break; 


EAX = 80000000H; 
// Extended leaf 
EAX = @8H; // Highest extended function CPUID input value 


EBX = Q; // Reserved 
ECX = Q; // Reserved 
EDX = Q; // Reserved 
break; 

EAX = 80000001H; 
EAX = Q; // Reserved 
EBX = Q; // Reserved 
ECXLQ] = 1; // LAHF/SAHF support in 64 bit mode 
ECX[31:1] = Q; // Reserved 
EDXL[10:0] = Q; // Reserved 
EDX[11] = 1; // SYSCALL/SYSRET available in 64 bit mode 
EDX[19:12] = Q; // Reserved 
EDXL20] = Q; // Execute Disable Bit available 
EDX[28:21] = Q; // Reserved 
EDX[29] = 1; // Intel(R) 64 Technology available 
EDX[31:30] = Q; // Reserved 
break; 

EAX = 80000002H; 
EAX = Q; // Processor Brand String 
EBX = Q; // Processor Brand String Continued 
ECX = Q; // Processor Brand String Continued 
EDX = Q; // Processor Brand String Continued 
break; 

EAX = 80000003H; 
EAX = Q; // Processor Brand String Continued 
EBX = Q; // Processor Brand String Continued 
ECX = Q; // Processor Brand String Continued 
EDX = Q; // Processor Brand String Continued 
break; 

EAX = 80000004H; 
EAX = Q; // Processor Brand String Continued 
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EBX = @; // Processor Brand String Continued 
ECX = Q; // Processor Brand String Continued 
EDX = Q; // Processor Brand String Continued 
break; 
EAX = 80000005H; 
EAX = Q; // Reserved 
EBX = Q; // Reserved 
ECX = Q; // Reserved 
EDX = Q; // Reserved 
break; 
EAX = 80000006H; 
EAX = Q; // Reserved 
EBX = Q; // Reserved 
ECXL7:0] = 64; // 2 cache Line size in bytes 
ECX[15:12] = 6; // L2 associativity field (8-way) 
ECX[31:16] = 256; // L2 cache size in 1K units 
EDX = Q; // Reserved 
break; 
EAX = 80000007H; 
EAX = Q; // Reserved 
EBX = Q; // Reserved 
ECX = Q@; // Reserved 
EDX = Q; // Reserved 
break; 
EAX = 80000008H; 
EAXL7:0] = 40; // Physical Address bits 
EAX[15:8] = 48; // Virtual Address bits 
EAX[31:16] = Q; // Reserved 
EBX = Q; // Reserved 
ECX = Q; // Reserved 
EDX = Q; // Reserved 
break; 
default, EAX == 1H: 
EAX[3:0] = Stepping ID; 
EAXL7:4] = *see tablex // Model 
EAX[11:8] = 1011B; // Family 
EAX[13:12] = QQB; // Processor type 
EAX[15:14] = QQB; // Reserved 
EAXL19:16] = QQQQB; // Extended Model 
EAX[23:20] = 00000000B; // Extended Family 
EAX[31:24] = QQH; // Reserved; 
EBX[7:0] = Q0H; // Brand Index (* Reserved if the value is zero *) 
EBX[15:8] = 8; // CLEVICT1/CLFLISH Line Size (x8) 
EBX[23:16] = 248; // Maximum number of logical processors 
EBX[31:24] = Initial Apic ID; 
ECX = 00000000H; // Feature flags 
EDX = 110193FFH; // Feature flags 
break; 
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Flags Affected 


None. 


Intel’ C/C++ Compiler Intrinsic Equivalent 


None 


Exceptions 


None. 


Reference Number: 327364-001 677 


intel 
APPENDIX B. KNIGHTS CORNER 64 BIT MODE SCALAR INSTRUCTION SUPPORT (inte! 


Bit | Mnemonic Description Return 
# Value 
0 FPU Floating-point Unit On-Chip. The processor contains an x87 FPU. 1 

1 VME Virtual 8086 Mode Enhancements. Virtual 8086 mode enhancements, in- | 1 


cluding CR4.VME for controlling the feature, CR4.PVI for protected mode vir- 
tual interrupts, software interrupt indirection, expansion of the TSS with the 
software indirection bitmap, and EFLAGS.VIF and EFLAGS.VIP flags. 


2 DE Debugging Extensions. Support for I/O breakpoints, including CR4.DE for | 1 
controlling the feature, and optional trapping of accesses to DR4 and DRS. 
3 PSE Page Size Extension. Large pages of size 4 MByte are supported, including | 1 


CR4.PSE for controlling the feature, the defined dirty bit in PDE (Page Directory 
Entries), optional reserved bit trapping in CR3, PDEs, and PTEs. 


4 TSC Time Stamp Counter. The RDTSC instruction is supported, including CR4.TSD | 1 
for controlling privilege. 

5 MSR Model Specific Registers RDMSR and WRMSR Instructions. The RDMSR | 1 
and WRMSR instructions are supported. Some of the MSRs are implementation 
dependent. 

6 PAE Physical Address Extension. Physical addresses greater than 32 bits are sup- | 1 


ported: extended page table entry formats, an extra level in the page transla- 
tion tables is defined, 2-MByte pages are supported instead of 4 Mbyte pages 
if PAE bit is 1. The actual number of address bits beyond 32 is not defined, and 
is implementation specific. 

7 MCE Machine Check Exception. Exception 18 is defined for Machine Checks, in- | 1 
cluding CR4.MCE for controlling the feature. This feature does not define 
the model-specific implementations of machine-check error logging, report- 
ing, and processor shutdowns. Machine Check exception handlers may have to 
depend on processor version to do model specific processing of the exception, 
or test for the presence of the Machine Check feature. 


8 CX8 CMPXCHG8B Instruction. The compare-and-exchange 8 bytes (64 bits) in- | 1 
struction is supported (implicitly locked and atomic). 
9 APIC APIC On-Chip. The processor contains an Advanced Programmable Interrupt | ? 


Controller (APIC), responding to memory mapped commands in the physical 
address range FFFEOO00H to FFFEOFFFH (by default - some processors permit 
the APIC to be relocated). 


10 | Reserved Reserved 0 

11 | SEP SYSENTER and SYSEXIT Instructions. The SYSENTER and SYSEXIT and as- | 0 
sociated MSRs are supported. 

12 | MTRR Memory Type Range Registers. MTRRs are supported. The MTRRcap MSR | 1 


contains feature bits that describe what memory types are supported, how 
many variable MTRRs are supported, and whether fixed MTRRs are supported. 
13 | PGE PTE Global Bit. The global bit in page directory entries (PDEs) and page table | 0 
entries (PTEs) is supported, indicating TLB entries that are common to differ- 
ent processes and need not be flushed. The CR4.PGE bit controls this feature. 
14 MCA Machine Check Architecture. The Machine Check Architecture, which pro- | 0 
vides a compatible mechanism for error reporting in P6 family, Pentium® 4, 
Intel® Xeon®processors, and future processors, is supported. The MCG_CAP 
MSR contains feature bits describing how many banks of error reporting MSRs 
are supported. 


Table B.8: Feature Information Returned in the EDX Register (CPUID.EAX[01h].EDX) 
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Mnemonic 


Description 


Return 
Value 


CMOV 


Conditional Move Instructions. The conditional move instruction CMOV is 
supported. In addition, if x87 FPU is present as indicated by the CPUID.FPU 
feature bit, then the FCOMI and FCMOV instructions are supported 


0 


16 


PAT 


Page Attribute Table. Page Attribute Table is supported. This feature aug- 
ments the Memory Type Range Registers (MTRRs), allowing an operating sys- 
tem to specify attributes of memory on a 4K granularity through a linear ad- 
dress. 


17 


PSE-36 


36-Bit Page Size Extension. Extended 4-MByte pages that are capable of ad- 
dressing physical memory beyond 4 GBytes are supported. This feature indi- 
cates that the upper four bits of the physical address of the 4-MByte page is 
encoded by bits 13-16 of the page directory entry. 


18 


PSN 


Processor Serial Number. The processor supports the 96-bit processor iden- 
tification number feature and the feature is enabled. 


19 


CLFSH 


CLFLUSH Instruction. CLFLUSH Instruction is supported. 


20 


Reserved 


Reserved 


i=) 


21 


DS 


Debug Store. The processor supports the ability to write debug information 
into a memory resident buffer. This feature is used by the branch trace store 
(BTS) and precise event-based sampling (PEBS) facilities (see Chapter 15, De- 
bugging and Performance Monitoring, in the IA-32 Intel® Architecture Soft- 
ware Developers Manual, Volume 3). 


22 


ACPI 


Thermal Monitor and Software Controlled Clock Facilities. The processor im- 
plements internal MSRs that allow processor temperature to be monitored 
and processor performance to be modulated in predefined duty cycles under 
software control. 


23 


Intel® MMX™ Technology. The processor supports the Intel® MMX™ technol- 
ogy. 


24 


FXSAVE and FXRSTOR Instructions. The FXSAVE and FXRSTOR instructions 
are supported for fast save and restore of the floating-point context. Presence 
of this bit also indicates that CR4.0SFXSR is available for an operating system 
to indicate that it supports the FXSAVE and FXRSTOR instructions. 


25 


Intel® SSE 


Intel® SSE. The processor supports the Intel® SSE extensions. 


26 


Intel® SSE2 


Intel® SSE2. The processor supports the Intel® SSE2 extensions. 


i=) 


27 


SS 


Self Snoop. The processor supports the management of conflicting memory 
types by performing a snoop of its own cache structure for transactions issued 
to the bus. 


28 


HTT 


Multi-Threading. The physical processor package is capable of supporting 
more than one logical processor. 


29 


T™ 


Thermal Monitor. The processor implements the thermal monitor automatic 
thermal control circuitry (TCC). 


30 


Reserved 


Reserved 


31 


PBE 


Pending Break Enable. The processor supports the use of the FERR#/PBE# 
pin when the processor is in the stop-clock state (STPCLK# is asserted) to sig- 
nal the processor that an interrupt is pending and that the processor should 
return to normal operation to handle the interrupt. Bit 10 (PBE enable) in the 
1A32_MISC_ENABLE MSR enables this capability. 


Table B.9: Feature Information Returned in the EDX Register (CPUID.EAX[01h].EDX) (Contd.) 
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Bit # Mnemonic Description Return 
Value 

0 Intel® SSE3 Streaming SIMD Extensions 3 (SSE3). A value of 1 indicates the | 0 
processor supports this technology. 

1-2 Reserved Reserved 0 

3 MONITOR MONITOR/MWAIT. A value of 1 indicates the processor sup- | 0 
ports this feature. 

4 DS-CPL CPL Qualified Debug Store. A value of 1 indicates the processor | 0 
supports the extensions to the Debug Store feature to allow for 
branch message storage qualified by CPL. 

5 VMX Virtual Machine Extensions. A value of 1 indicates that the pro- | 0 
cessor supports this technology. 

6 Reserved Reserved 0 

7 EST Enhanced Intel® SpeedStep® technology. A value of 1 indicates | 0 
that the processor supports this technology. 

8 TM2 Thermal Monitor 2. A value of 1 indicates whether the proces- | 0 
sor supports this technology. 

9 SSSE3 Supplemental Streaming SIMD Extensions 3 (SSSE3). A value | 0 
of 1 indicates the processor supports this technology. 

10 CNXT-ID L1 Context ID. A value of 1 indicates the L1 data cache mode | 0 
can be set to either adaptive mode or shared mode. A value of 
0 indicates this feature is not supported. See definition of the 
1A32_MISC_ENABLE MSR Bit 24 (L1 Data Cache Context Mode) 
for details. 

11-12 Reserved Reserved 0 

13 CMPXCHG16B CMPXCHG16B Available. A value of 1 indicates that the feature | 0 
is available. See the CMPXCHG8B/CMPXCHG16BCompare and 
Exchange Bytes section in Volume 2A. 

14 xTPR = Update xTPR Update Control. A value of 1 indicates that the processor | 0 

Control supports changing IA32_MISC_ENABLES|bit 23]. 

15 PDCM Perf/Debug Capability MSR. A value of 1 indicates that the pro- | 0 
cessor supports the performance and debug feature indication 
MSR 

18-16 Reserved Reserved 0 

19 Intel® SSE4.1 Intel® Streaming SIMD Extensions 4.1 (Intel® SSE4.1). Avalue | 0 
of 1 indicates the processor supports this technology. 

20 Intel® SSE4.2 Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2). Avalue | 0 
of 1 indicates the processor supports this technology. 

22-21 Reserved Reserved 0 

23 POPCNT POPCNT. A value of 1 indicates the processor supports the | 0° 
POPCNT instruction. 

31-24 Reserved Reserved 0 


Table B.10: Feature Information Returned in the ECX Register (CPUID.EAX[01h].ECX) 


*CPUID bit 23 erroneously indicates that POPCNT is not supported. Knights Corner does support the POPCNT instruction. See Appendix A 
for more information. 
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Appendix C 


Floating-Point Exception Summary 


C.1 Instruction floating-point exception summary 


Table C.3 shows all those instruction that can generate a floating-point exception. Each type of exception is 
shown per instruction. For each table entry you will find one of the following symbols: 


Nothing : Exception of that type cannot be produced by that instruction. 


Yootn: The instruction can produce that exception. The exception may be produced by either the operation 
or the data-type conversion applied to memory operand. 


Yceonv: The instruction can produce that exception. That exception can only be produced by the data-type 
conversion applied to memory operand. 


Yoper: The instruction can produce that exception. The exception can only be produced by the operation. 
The data-type conversion applied to the memory operand cannot produce any exception. 


Instruction #1 #D #Z #0 #U #P 
vaddpd Yooth Yoper Yoper Yoper Yoper 
vaddps Yooth Yoper Yoper Yoper Yoper 
vaddnpd Yooth Yoper Yoper Yoper Yoper 
vaddnps Yooth Yoper Yoper Yoner Yoper 
vaddsetsps Yooth Yoper Yoper Yoper Yoper 
vblendmps Veony 

vbroadcastf32x4 | Ycoony 

vbroadcastss Yeon 

vemppd Yoboth Yoper 

vcempps Yooth | Yoper 

vevtpd2ps Yoboth Yoper Yooper Yoper Yoper 
vcvtps2pd Yooth | Yoper 

vevtfxpntdq2ps Yoper 
vevtfxpntpd2dq Yooth Yoper 
vevtfxpntpd2udq | Yootn Yoper 
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Instruction #] #D #Z #0 #U #P 
vevtfxpntps2dq Ypoth Yoper 
vevtfxpntps2udq | Yoorn Yoper 
vevtfxpntudq2ps Yoper 
vexp223ps Yoper 

vfixupnanpd Ypoth 

vfixupnanps Yooth 

vfmadd132pd Ypoth Yoper Yoper Yoper Yoper 
vfmadd132ps Yooth Yoper Yoper Yoper Yoper 
vfmadd2 13pd Ypoth Yoper Y oper Yoper Yoper 
vfmadd2 13ps Yooth Yoper Yoper Yoper Yoper 
vfmadd23 1pd Yooth Yoper Yoper Yoper Yoper 
vfmadd23 1ps Yooth Yoper Yoper Yoper Yoper 
vfmadd233ps Yooth Yoper Yoper Yoper Yoper 
vfmsub132pd Yooth Yoper Yoper Yoper Y oper 
vfmsub132ps Yboth Yoper Yoper Yoper Yoper 
vfmsub2 13pd Yooth Yoper Yoper Yoper Yoper 
vfmsub2 13ps Yboth Yoper Yoper Yoper Yoper 
vfmsub23 1pd Yooth Yoper Yoper Yoper Yoper 
vfmsub23 1ps Yooth Yoper Yoper Yoper Yoper 
vfnmadd132pd Voorn | Yasar Yoper | Yeper | Yoper 
vfnmadd132ps Vise | Youer Yoper | Yoper | Yoper 
vfnmadd213pd Viues,.| Yonex Yoper | Yoper | Yoper 
vfnmadd213ps Yioth:. | Xaner Yoper | Yoper | Yoper 
vfnmadd231pd Yoon: | Yaner Yoper | Yoper | Yoper 
vfnmadd231ps Yootn | Yoper Yoper | Yoper | Yoper 
vfnmsub132pd Vion. | Yoner Yoper | Yoper | Yoper 
vfnmsub132ps Yiote |! Yauex Yoper | Yoper | Yoper 
vinmsub213pd Vion. | Sener Yoper | Yoper | Yoper 
vfnmsub213ps Yoon. |. Yanex Yoper | Yoper | Yoper 
vfnmsub23 1pd Yboth Yoper Yoper Yoper Yoper 
vfnmsub231ps Vion. |) Vance Yoper | Yoper | Yoper 
vgatherdps Yeonv 

vgetexppd Yooth | Yoper 

vgetexpps Yooth | Yoper 

vgetmantpd Yooth | Yoper 

vgetmantps Yooth | Yoper 

vgmaxpd Yoboth Yoper 

vemaxps Yooth | Yoper 

vgmaxabsps Yooth | Yoper 

veminpd Yooth | Yoper 

veminps Yooth | Yoper 

vloadunpackhps | Yeony 

vloadunpacklps Yicna 

vlog2ps Yooth Yoper 

vmovaps (load) Years 

vmovaps (store) | Yeony | Yeonv Yoonw | Yeonv | Yeonv 
vmulpd Yboth Yoper Yoper Yoper Yoper 
vmulps Yboth Yoper Yoper Yoper Yoper 
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Instruction #1 #D #Z #0 #U #P 
vp acksto rehps Yeony Yeonv Yoonv Yeonv Yeon 
vp acksto relps Yeonv Yeonv Yeon Yeon Yeonv 
vrcp23ps Yooth Yoper 

vrndfxpntpd Ypoth Yoper 
vrndfxpntps Yooth Yoper 
vrsqrt23ps Yooth Yoper 

vscaleps Yooper Yoper Y oper Y oper Y oper 
vscatterdps Yeon Yeon Yeonu Yeon Yeonu 
vsubpd Yooth Yooper Yoper Yoper Yoper 
vsubps Yboth Yoper Yoper Yoper Yoper 
vsubrpd Yboth Yoper Yoper Yoper Yoper 
vsubrps Yboth Yoper Yoper Yoper Yoper 


C.2 Conversion floating-point exception summary 


Float-to-float 

Float16 to float32 SwizzUpConv/UpConv | Invalid (on SNaN) 

Float32 to float64 VCVTPS2PD Invalid (on SNaN), Denormal 

Float32 to float16 DownConv Invalid (on SNaN), Overflow, Underflow, 
Precision, Denormal 

Float64 to float32 VCVTPD2PS Invalid (on SNaN), Overflow, Underflow, 
Precision, Denormal 

Integer-to-float 

Uint8/16 to float32 UpConv None 

Sint8/16 to float32 UpConv None 

Uint32 to float32 VCVTFXPNTUDQ2PS_ | Precision 

Sint32 to float32 VCVTFXPNTDQ2PS Precision 

Uint32 to float64 VCVTUDQ2PD None 

Sint32 to float64 VCVTDQ2PD None 

Float-to-integer 

Float32 to uint8/16 DownConv Invalid (on NaN, out-of-range), Precision 
(if in-range but input not integer) 

Float32 to sint8/16 DownConv Invalid (on NaN, out-of-range), Precision 
(if in-range but input not integer) 

Float32 to uint32 VCVTFXPNTPS2UDQ_| Invalid (on NaN, out-of-range), Precision 
(if in-range but input not integer) 

Float32 to sint32 VCVTFXPNTPS2DQ Invalid (on NaN, out-of-range), Precision 
(if in-range but input not integer) 

Float64 to uint32 VCVTFXPNTPD2UDQ _ | Invalid (on NaN, out-of-range), Precision 
(if in-range but input not integer) 

Float64 to sint32 VCVTFXPNTPD2DQ Invalid (on NaN, out-of-range), Precision 
(if in-range but input not integer) 
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Out-of-range values are dependent on operation definition and rounding mode. Table C.3 and Table C.4 describe 
maximum and minimum allowed values for float to integer and float to float conversion respectively. Please note 
that presented ranges are considered after “Denormals Are Zero (DAZ)" are applied. 


Those entries in Table C.4 labelled with an asterisk(*), are not required for Knights Corner. 


C.3. Denormal behavior 


684 


Instruction Treat Input Denormals As Zeros | Flush Tiny Results To Zero 
vaddpd MXCSR.DAZ MXCSR.FZ 
vaddps MXCSR.DAZ MXCSR.FZ 
vaddnpd MXCSR.DAZ MXCSR.FZ 
vaddnps MXCSR.DAZ MXCSR.FZ 
vaddsetsps MXCSR.DAZ MXCSR.FZ 
vblendmpd NO NO 

vblendmps NO NO 

vemppd MXCSR.DAZ Not Applicable 
vcempps MXCSR.DAZ Not Applicable 
vcevtdq2pd Not Applicable Not Applicable 
vevtpd2ps MXCSR.DAZ MXCSR.FZ 
vcvtps2pd MXCSR.DAZ Not Applicable 
vcvtudq2pd Not Applicable Not Applicable 
vevtfxpntdq2ps Not Applicable Not Applicable 
vevtfxpntpd2dq MXCSR.DAZ Not Applicable 
vevtfxpntpd2udq MXCSR.DAZ Not Applicable 
vevtfxpntps2dq MXCSR.DAZ Not Applicable 
vevtfxpntps2udq MXCSR.DAZ Not Applicable 
vevtfxpntudq2ps Not Applicable Not Applicable 
vexp223ps Not Applicable YES 
vfixupnanpd MXCSR.DAZ NO 
vfixupnanps MXCSR.DAZ NO 
vfmadd132pd MXCSR.DAZ MXCSR.FZ 
vfmadd132ps MXCSR.DAZ MXCSR.FZ 
vfmadd213pd MXCSR.DAZ MXCSR.FZ 
vfmadd213ps MXCSR.DAZ MXCSR.FZ 
vfmadd231pd MXCSR.DAZ MXCSR.FZ 
vfmadd231ps MXCSR.DAZ MXCSR.FZ 
vfmadd233ps MXCSR.DAZ MXCSR.FZ 
vfmsub132pd MXCSR.DAZ MXCSR.FZ 
vfmsub132ps MXCSR.DAZ MXCSR.FZ 
vfmsub213pd MXCSR.DAZ MXCSR.FZ 
vfmsub213ps MXCSR.DAZ MXCSR.FZ 
vfmsub231pd MXCSR.DAZ MXCSR.FZ 
vfmsub231ps MXCSR.DAZ MXCSR.FZ 
vfnmadd132pd MXCSR.DAZ MXCSR.FZ 
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Instruction Treat Input Denormals As Zeros | Flush Tiny Results To Zero 
vfnmadd132ps MXCSR.DAZ MXCSR.FZ 
vfnmadd213pd MXCSR.DAZ MXCSR.FZ 
vfnmadd213ps MXCSR.DAZ MXCSR.FZ 
vfnmadd231pd MXCSR.DAZ MXCSR.FZ 
vfnmadd231ps MXCSR.DAZ MXCSR.FZ 
vfnmsub132pd MXCSR.DAZ MXCSR.FZ 
vfnmsub132ps MXCSR.DAZ MXCSR.FZ 
vfnmsub213pd MXCSR.DAZ MXCSR.FZ 
vfnmsub213ps MXCSR.DAZ MXCSR.FZ 
vfnmsub231pd MXCSR.DAZ MXCSR.FZ 
vfnmsub231ps MXCSR.DAZ MXCSR.FZ 
vgatherdpd NO NO 

vgatherdps NO NO 
vgatherpf0dps NO NO 
vgatherpfOhintdpd NO NO 
vgatherpfOhintdps NO NO 
vgatherpfidps NO NO 

vgetexppd MXCSR.DAZ Not Applicable 
vgetexpps MXCSR.DAZ Not Applicable 
vgetmantpd MXCSR.DAZ Not Applicable 
vgetmantps MXCSR.DAZ Not Applicable 
vgmaxpd MXCSR.DAZ NO 

vgmaxps MXCSR.DAZ NO 
vgmaxabsps MXCSR.DAZ NO 

vgminpd MXCSR.DAZ NO 

vgminps MXCSR.DAZ NO 
vloadunpackhpd NO NO 
vloadunpackhps NO NO 
vloadunpacklpd NO NO 
vloadunpacklps NO NO 

vlog2ps YES YES 

vmovapd (load) NO NO 

vmovapd (store) NO (DAZ*) NO 

vmovaps (load) NO NO 

vmovaps (store) NO (DAZ*) NO 
vmovnrapd (load) NO NO 
vmovnrapd (store) NO (DAZ*) NO 
vmovnraps (load) NO NO 
vmovnraps (store) NO (DAZ*) NO 
vmovnrngoapd (load) | NO NO 
vmovnrngoapd (store) | NO (DAZ*) NO 
vmovnrngoaps (load) NO NO 
vmovnrngoaps (store) | NO (DAZ*) NO 

vmulpd MXCSR.DAZ MXCSR.FZ 
vmulps MXCSR.DAZ MXCSR.FZ 
vpackstorehpd NO (DAZ*) NO 
vpackstorehps NO (DAZ*) NO 
vpackstorelpd NO (DAZ*) NO 
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Instruction Treat Input Denormals As Zeros | Flush Tiny Results To Zero 
vpackstorelps NO (DAZ*) NO 
vrcp23ps YES YES 
vrndfxpntpd MXCSR.DAZ NO 
vrndfxpntps MXCSR.DAZ NO 
vrsqrt23ps YES YES 
vscaleps MXCSR.DAZ MXCSR.FZ 
vscatterdpd NO (DAZ*) NO 
vscatterdps NO (DAZ*) NO 
vscatterpf0dps NO NO 
vscatterpfOhintdpd NO NO 
vscatterpfOhintdps NO NO 
vscatterpfidps NO NO 
vsubpd MXCSR.DAZ MXCSR.FZ 
vsubps MXCSR.DAZ MXCSR.FZ 
vsubrpd MXCSR.DAZ MXCSR.FZ 
vsubrps MXCSR.DAZ MXCSR.FZ 


(*) FP32 down-conversion obeys MXCSR.DAZ 
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(dint + O'T - LEvZ-) JJIJTO0000002T2X0 (dint - TEvZ) JWIIIIIUPTPXO Zu OdZddLNdXALADA | ZEWIS 01 79IeO] J 
(dint + O'L - LEvZ-) JJJIJTOO000002T2x0 (OT - LEvZ) 0000029JJIIIPT XO au OdZddLNdXALADA | ZEWIS 01 79IeOLJ 
(TEvZ-) 0000000000000°T2x0 (dint - LEvZ) JWJIIIIIUPTPXO au OdZAdLNdXALADA | ZEIIS 07 9IOL A 
(S'0 - LEvZ-) 00000T0000000°T9x0 (dint - $0 - LEvZ) JIPIIIUPTLXO NU OdZddLNdXALADA | ZEWIS 01 79IeOLJ 
(dint + O'T-) JIIIIIBIIOIIXO (dint - Z€vZ) WIIIHIIIJOLLXO Zu OGNZAdLNdXALADA | ZEW 0} 79IeOT 
(dint + O'L-) JIJIIIIUIIOYIXO (0'L - ZEvZ) 0000093IIH2T XO ext OAGNZdd.LNdXALADA | ZEUM 0} 79IOT A 
(0°0-) 0000000000000008x0 (dint - ZEvZ) WIIIHIIIJOLLXO au OGNZddLNdXALADA | ZEIUIM 07 79IeO] J 
(S'0-) 0000000000000959x0 (dint - $°0 - ZEvZ) JIFIIHIIOTLXO NU OaNZddLNdXALADA | ZEIUM 07 79IeOLJ 
(TEvZ-) 00000059x0 (dint - LEZ) JWIOPXO ZU OAZSdLNdXALADA | ZEUS 01 ZEICOTA 
(TEvZ-) 00000059X0 (dint - LEvZ) JyIJOvXO au OdZSdLNdXALADA | ZEWIS 07 ZEIeOTA 
(TEvZ-) 00000059x0 (dint - LEZ) JWIIOPXO au OAZSdLNdXALADA | ZEWIS 01 ZEICOTA 
(TEvZ-) 00000059x0 (dint - LEZ) JWIOPXO NU OAZSdLNdXALADA | ZEUS 01 ZEICOTA 

(dint + O°T-) JBIZIIXO (dint - Z£vZ) JILIVXO Zu OGNZSHLNdXALADA | ZEIIN 0} ZEIeOTA 

(dint + O°T-) JWIZIIXO (dint - Z€vZ) JGHLIVXO ax OANZSdLNdUXALADA | ZEIIN 0} ZEIeOTA 

(0°0-) 00000008x0 (dint - Z€vZ) JGHLIVXO au OCGNZSALNdXALADA | ZEIIN 07 ZEI COTA 

(S"0-) 00000059x0 (dint - Z€vZ) JGHLIVXO Nu OGNZSALNdXALADA | ZEIIN 07 ZE1 COTA 
(S*89ZZE-) 080000L9X0 (dint - SZ9LZE) HeWIOXO NU AuojUMOg QTIWIS 0} ZEICOTA 
(s'0-) 00000059x0 (dint - S'S€SS9) JLIGLLbXO NU AuojUMOGg gTquM 03 Z¢€1eO} 4 

(S°8ZT-) 000800E2x0 (dint - S°ZZT) HIIZHXO NU AuojUMOg QqUIS 0} ZEIOTA 

(s'0-) 00000034x0 (dint - $°SSZ) JIZILEVXO NU AuojUMOGd gquin 03 Z7¢3eO] 

Ul xXeY surlpunoy 37X9]U0*) UOISIOAUO‘) 


ntel 
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Case Rounding Max pos arg w/o overflow Min pos arg w/ overflow 
Float32 to float16 RN 0x477fefff (65520.0 - Lulp) 0x477ff000 (65520.0) 

RD* 0x477fffff (65536.0 - Lulp) 0x47800000 (65536.0) 

RU* 0x477fe000 (65504.0) 0x477fe001 (65504.0 + 1ulp) 

RZ 0x477fffff (65536.0 - Lulp) 0x47800000 (65536.0) 
Float64 to float32 RN 0x47 efffffeffffttf (2125 — 215 — lulp) 0x47 effffff0000000 (2178 — 2193) 

RD 0x47 ef ffffffffffff (212° — 1ulp) 0x4-7f0000000000000 (2!28-°) 

RU 0x47 efffffe0000000 (2125 — 214) 0x47efffffe0000001 (2128 — 2194 + 1ulp) 

RZ 0x47 ef fffftfffffff (21° — 1ulp) 0x4-7f0000000000000 (2!28-°) 
Case Rounding Max neg arg w/o overflow Min neg arg w/ overflow 
Float32 to float16 RN Oxc77fefff (-65520.0 + 1ulp) Oxc77ff000 (-65520.0) 

RD* Oxc77fe000 (-65504.0) Oxc77fe001 (-65504.0 - 1ulp) 

RU* Oxc77 fffff (-65536.0 + 1ulp) 0xc7800000 (-65536.0) 

RZ Oxc77 fffff (-65536.0 + 1ulp) 0xc7800000 (-65536.0) 
Float64 to float32 RN Oxc7efffffefffffff (—2'7° + 21° + 1ulp) Oxc7 effffff0000000 (—2!78 + 21°) 

RD Oxc7efffffe0000000 (—2!78 + 2194) | Oxc7efffffe0000001 (—2!?8 + 21°4 — 1ulp) 

RU Oxc7 ef ffffffffffff (—2'?5 + 1ulp) 0xc7f0000000000000 (—2!28-°) 

RZ Oxc7 ef ffffffffffff (—2'?5 + 1ulp) 0xc7£0000000000000 (—2!2°-°) 


Table C.4: Float-to-float Max/Min Valid Range 
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Appendix D 


Instruction Attributes and Categories 


In this Appendix we enumerate instruction attributes and categories 
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APPENDIX D. INSTRUCTION ATTRIBUTES AND CATEGORIES 


D.1 Conversion Instruction Families 


D.1.1 Dys2 Family of Instructions 


VMOVAPS VMOVNRAPS VMOVNRNGOAPS  VPACKSTOREHPS 
VPACKSTORELPS VSCATTERDPS VSCATTERPF1DPS 


D.1.2 Dye, Family of Instructions 
VMOVAPD VMOVNRAPD VMOVNRNGOAPD VPACKSTOREHPD 


VPACKSTORELPD VSCATTERDPD 


D.1.3. D3 Family of Instructions 


VMOVDQA32 VPACKSTOREHD VPACKSTORELD VPSCATTERDD 


D.1.4 Die, Family of Instructions 


VMOVDQA64 VPACKSTOREHQ VPACKSTORELQ VPSCATTERDQ 


D.1.5 S39 Family of Instructions 


VADDNPS VADDPS VADDSETSPS VBLENDMPS 
VCMPPS VCVTFXPNTPS2DQ VCVTFXPNTPS2UDQ VCVTPS2PD 
VFMADD132PS VFMADD213PS VFMADD231PS VFMADD233PS 
VFMSUB132PS VFMSUB213PS VFMSUB231PS VFNMADD132PS 
VFNMADD213PS VFNMADD231PS VFNMSUB132PS VFNMSUB213PS 
VFNMSUB231PS VGETEXPPS VGETMANTPS VGMAXABSPS 
VGMAXPS VGMINPS VMULPS VRNDFXPNTPS 
VSUBPS VSUBRPS 


D.1.6 Sy¢4 Family of Instructions 


VADDNPD VADDPD VBLENDMPD VCMPPD 
VCVTFXPNTPD2DQ VCVTFXPNTPD2UDQ VCVTPD2PS VFMADD132PD 
VFMADD213PD VFMADD231PD VFMSUB132PD VFMSUB213PD 
VFMSUB231PD VFNMADD132PD VFNMADD213PD VFNMADD231PD 
VFNMSUB132PD VFNMSUB213PD VFNMSUB231PD VGETEXPPD 
VGETMANTPD VGMAXPD VGMINPD VMULPD 
VRNDFXPNTPD VSUBPD VSUBRPD 
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D.1.7 5,32 Family of Instructions 


VCVTDQ2PD VCVTFXPNTDQ2PS VCVTFXPNTUDQ2PS VCVTUDQ2PD 


VFIXUPNANPS VPADCD VPADDD VPADDSETCD 
VPADDSETSD VPANDD VPANDND VPBLENDMD 
VPCMPD VPCMPEQD VPCMPGTD VPCMPLTD 
VPCMPUD VPMADD231D VPMADD233D VPMAXSD 
VPMAXUD VPMINSD VPMINUD VPMULHD 
VPMULHUD VPMULLD VPORD VPSBBD 
VPSBBRD VPSLLD VPSLLVD VPSRAD 
VPSRAVD VPSRLD VPSRLVD VPSUBD 
VPSUBRD VPSUBRSETBD VPSUBSETBD VPTESTMD 
VPXORD VSCALEPS 


D.1.8 Si¢4 Family of Instructions 


VFIXUPNANPD VPANDNQ VPANDQ VPBLENDMQ 
VPORQ VPXORQ 


D.1.9 Us32 Family of Instructions 


VBROADCASTF32X4 VBROADCASTSS VGATHERDPS VGATHERPFODPS 
VGATHERPFOHINTDPS VGATHERPF1DPS VLOADUNPACKHPS VLOADUNPACKLPS 
VMOVAPS VMOVNRAPS VMOVNRNGOAPS VSCATTERPFODPS 


VSCATTERPFOHINTDPS 


D.1.10 Uye4 Family of Instructions 


VBROADCASTF64X4_ VBROADCASTSD VGATHERDPD VGATHERPFOHINTDPD 
VLOADUNPACKHPD VLOADUNPACKLPD VMOVAPD VMOVNRAPD 
VMOVNRNGOAPD VSCATTERPFOHINTDPD 


D.1.11 Uj32 Family of Instructions 


VBROADCASTI32X4 VLOADUNPACKHD VLOADUNPACKLD VMOVDQA32 
VPBROADCASTD VPGATHERDD 


D.1.12 Ue, Family of Instructions 


VBROADCASTI64X4 VLOADUNPACKHQ VLOADUNPACKLQ VMOVDQA64 
VPBROADCASTQ VPGATHERDQ 
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Appendix E 


Non-faulting Undefined Opcodes 


The following opcodes are non-faulting and have undefined behavior: 


692 


MVEX.512.0F38.W0 D2 /r 
MVEX.512.0F38.W0 D3 /r 
MVEX.512.0F38.W0 D6 /r 
MVEX.512.0F38.W0 D7 /r 
MVEX.512.66.0F38.W0 48 /r 
MVEX.512.66.0F38.W0 49 /r 
MVEX.512.66.0F38.W0 4A /r 
MVEX.512.66.0F38.W0 4B /r 
MVEX.512.66.0F38.W0 68 /r 
MVEX.512.66.0F38.W0 69 /r 
MVEX.512.66.0F38.W0 6A /r 
MVEX.512.66.0F38.W0 6B /r 
MVEX.512.66.0F38.W0 BO /r /vsib 
MVEX.512.66.0F38.W0 B2 /r /vsib 
MVEX.512.66.0F38.W0 CO /r /vsib 
MVEX.512.66.0F38.W0 D2 /r 
MVEX.512.66.0F38.W0 D6 /r 
MVEX.512.66.0F3A.W0 DO /r ib 
MVEX.512.66.0F3A.W0 D1 /r ib 
MVEX.NDS.512.66.0F38.W0 54 /r 
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MVEX.NDS.512.66.0F38.W0 56 /r 
MVEX.NDS.512.66.0F38.W0 57 /r 
MVEX.NDS.512.66.0F38.W0 67 /r 
MVEX.NDS.512.66.0F38.W0 70 /r 
MVEX.NDS.512.66.0F38.W0 71 /r 
MVEX.NDS.512.66.0F38.W0 72 /r 
MVEX.NDS.512.66.0F38.W0 73 /r 
MVEX.NDS.512.66.0F38.W0 94 /r 
MVEX.NDS.512.66.0F38.W0 CE /r 
MVEX.NDS.512.66.0F38.W0 CF /r 
MVEX.NDS.512.66.0F38.W1 94 /r 


MVEX.NDS.512.66.0F38.W1 CE /r 
VEX.128.F2.0F38.W0 FO /r 
VEX.128.F2.0F38.W0 F1 /r 
VEX.128.F2.0F38.W1 FO /r 
VEX.128.F2.0F38.W1 F1 /r 
VEX.128.F3.0F38.W0 FO /r 
VEX.128.F3.0F38.W1 FO /r 
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Appendix F 


General Templates 


In this Chapter all the general templates are described. Each instruction has one (at least) valid format, and each 
format matches with one of these templates. 


694 Reference Number: 327364-001 


intel) 


F.1 Mask Operation Templates 


APPENDIX F. GENERAL TEMPLATES 


Reference Number: 327364-001 695 


> 
D 


APPENDIX F. GENERAL TEMPLATES 
Mask m0 - Template 


VMASKMask m0 
Opcode Instruction Description 
VEX.128 KOP k1, k2 Operate [mask k1 and] mask k2 [and store the 
result in k1] 
Description 


Operand is a register 


ESCAPE(C5)/1 1 0 0/0 1 0 1 
7 6 5 4 3 2 IU 
VEX2 1/1 1 1 1/0} nm po 
7 6 5 43-2 TU 
OPCODE OPCODE 
a a 
ModR/M 11 reg (K1) r (K2) 
7 6 5 4 3 2 TI 0 
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Mask m1 - Template 


VMASKMask m1 

Opcode Instruction Description 

VEX.128 KOP r32/r64, k1, imm8 Move mask k1 into r32/r64 using imm8 
Description 


Operand is a register 


ESCAPE(C4) 1 1 O 0 0 1 0 0 

7 65 a 3 2 0 
VEX1 Iregg 1 1 |ms mg mg my mo 

7 S35 z 3 z 0 
VEX2 W 1 1 1 1 | L=0 | py po 

7 65 a 3 2 0 
OPCODE OPCODE 

if 6 5 a 3 2 0 
ModR/M 11 reg (reg) r (K1) 

5 z 3 z U 

{IM M8} Tz Ig Is Ig Tg Tz I, Io 

7 65 a 3 2 0 
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VMASKMask m2 
Opcode Instruction Description 
Description 
Operand is a register 
ESCAPE(C4) 1 1 0 0 0 1 0 
7 6 5 7 3 z T U 
VEX1 Iregz 1 1 ma ms mz Mm, mo 
7 6 5 q 3 2 T 0 
VEX2 W 1 Kl» "Ky 'K1o L=0 Pi Po 
7 6 5 q 3 z T 0 
OPCODE OPCODE 
c 6 5 m1 3 z T U 
ModR/M 11 reg (reg) r (K2) 
7 6 q 3 2 T 0 
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Mask m3 - Template 


VMASKMask m3 

Opcode Instruction Description 

VEX.128 KOP r32/r64, k1 Move mask k1 into r32/r64 
Description 


Operand is a register 


ESCAPE(C5) 1 10 0;0 1 0 1 
7 6 5 4-32 U 
VEX1 lregg | 1 1 1 1/0] pp, po 
ig a a U 
OPCODE OPCODE 
7 6 5 4 32 0 
ModR/M 11 reg (reg) r (K1) 
7 Sb 5 4 3 2 0 
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Mask m4 - Template 


VMASKMask m4 

Opcode Instruction Description 

VEX.128 KOP k1, r32/r64 Move r32/r64 into mask k1 
Description 


Operand is a register 


C4 Version 
ESCAPE(C4) | 1 1 0 0 0 1 0 0 
76 5 a 3 Z 0 
VEX1 1 1 !regy | my m3 mg my, mo 
76 a 3 z 0 
VEX2 Wwi}i1 1 1 1 0 | pi po 
76 5 7 3 z 0 
OPCODE OPCODE 
76 5 a 3 Z U 
ModR/M 11 reg (K1) r (reg) 
76 5 a 3 2 U 
C5 Version 
ESCAPE(C5) | 1 1 0 0 0 1 0 1 
76 5 a 3 Z U 
VEX1 1 1 1 1 1 0 | pi po 
76 5 a 3 z 0 
OPCODE OPCODE 
76 5 a 3 Z U 
ModR/M 11 reg (K1) r (reg) 
76 5 a 3 z 0 
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Mask m5 - Template 


VMASKMask m5 

Opcode Instruction Description 

VEX.128 KOP k1, r32/r64, imm8 Move r32/r64 field into mask k1 using imm8 
Description 


Operand is a register 


ESCAPE(C4) | 1 1 0 0 0 1 0 
76 5 a 3 z T 0 
VEX1 1 1 treggy | m4 m3 mg my mo 
76 5 z 3 Z T U 
VEX2 wii 1 1 1 | L=0 |} py po 
7 6 5 a 3 z T 0 
OPCODE OPCODE 
7 6 5 4 3 2 I 0 
ModR/M 11 reg (K1) r (reg) 
76 5 z 3 Z T U 
{IM M8} Iz I¢ Ts I, I 1D I, Ip 
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Vector vO - Template 


VectorVector vO 
Opcode Instruction Description 
MVEX.512 VOP zmml1 _ {ki}, zmm2, Operate vector zmm2 and vector S(zmm3/m:) 
S(zmm3/mz) [and vector zmm1] and store the result in 
zmm1, under write-mask k1 
MVEX.512 VOP zmm1 _ {ki}, zmmz2, Operate vector zmm2 and vector S(zmm3/mz;) 
S(zmm3/m+), imm8s [and vector zmm1] and store the result in 
zmm1 using imm8, under write-mask k1 
Description 


Operand is a register 


ESCAPE(62) 
MVEX1 
MVEX2 
MVEX3 
OPCODE 
ModR/M 


{IM M8} 


Operand is a memory location 


ESCAPE(62) 
MVEX1 
MVEX2 
MVEX3 
OPCODE 
ModR/M 
{SIB} 


{DISPL} 
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0 1 1 0 0 0 1 0 
7 6 D 4 3 2 0) 
IZ 13 1734 1733 IZ14 m3 m2 my mo 
7 6 5 4 3 2 0 
WwW 1Z23 IZ 2 1Z2y IZ 20 L=0 Pi Po 
7 6 5 4 3 2 0 
EH So Sy So 1Z24 Kilo Ky Ko 
7 6 D a 3 2 0) 
OPCODE 
7 6 5 4 3 2 0 
11 reg (ZMM1) r (ZMM3) 

7 D a 3 2 0) 
Ty iy is v8 is I ii TG 
7 6 5 4 3 2 0 
0 1 1 0 0 0 1 0 
7 6 D 4 3 Z 0) 
IZ13 1X IB IZ14 m3, mga My mo 
7 6 5 4 3 2 0 
W IZ 23 IZ 29 172) IZ 20 L=0 Pi Po 
7 6 5 4 3 2 0 
EH | So Sy Sy | ae | Klis Aly ly 
7 6 D 4 3 2 0) 
OPCODE 
7 6 5 4 3 2 0 

mod reg (ZMM1) m (mt) 
7 6 D a 3 2 0) 
SIB byte 
id 6 5 4 3 2 0 
Displacement (8*N/32) 
31,8 : : . . 0 
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{IMM8} | Ir I i. Bh fb hh ty | 
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Vector v1 - Template 


VectorVector v1 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, S(mz) Load/brodcast vector S(m,) into zmm1, under 
write-mask k1 
Description 


Operand is a memory location 


ESCAPE(62) 0 ot 0 0 1 0 

7 6 5 4 3 z T U 
MVEX1 IZ13 1X IB IZ 14 M3 mg M1 mo 

7 6 5 q 3 z T y 
MVEX2 W 1 1 1 1 | L=0 | py Po 

7 6 5 a 3 z T U 
MVEX3 EH | Sp Si So 1 | Klg Kl, Kilo 

7 6 5 4 3 z T U 
OPCODE OPCODE 

7 6 5 q 3 z T 1 
ModR/M mod reg (ZMM1) m (mt) 

7 6 5 a 3 z T U 
{SIB} SIB byte 

7 6 5 4 3 z T y 
{DISPL} Displacement (8*N/32) 
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VectorVector v10 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, S(zmm2/m:) Operate vector S(zmm2/m,) and store the re- 
sult in zmm1, under write-mask k1 
MVEX.512 VOP zmm1 {k1}, S(zmm2/m,), Operate vector S(zmm2/m,) and store the re- 
imm8s sult in zmm1 using imm8, under write-mask k1 
MVEX.512 VOP zmm1 {k1}, S(zmm2/m:) Move vector S(zmm2/m;) into zmm1, under 
write-mask k1 
Description 


Operand is a register 


ESCAPE(62) | 0 1 1 0 0 0 1 0 
7 6 5 a 3 z U 
MVEX1 1 1224 !723 1 ms me my mo 
7 6 5 4 3 2 U 
MVEX2 W /!Z13) !Z12 !Z1, !Z159 | L=0 |] pi Po 
7 6 5 q 3 z y 
MVEX3 EH So Sy So | !Z14 | Klg K1y Ko 
7 6 5 a 3 Z U 
OPCODE OPCODE 
7 6 5 4 3 2 y 
ModR/M 11 Op. Ext. r (ZMM2) 
q z U 
{IM M8} Ly Ig Ts I T3 I I, Io 
7 6 5 4 3 2 U 
Operand is a memory location 
ESCAPE(62) | 0 1 1 0 0 0 at 0 
7 6 5 z 3 z U 
MVEX1 1 1X 'B 1 ms me m1  ™o 
7 6 5 4 3 2 U 
MVEX2 WwW IZ 13 IZ1o IZ 1, IZ1o L=0 P1 Po 
7 6 5 q 3 z y 
MVEX3 EH So Sy So | !Z14 | Klp K1, Ko 
7 6 5 a 3 z U 
OPCODE OPCODE 
7 6 5 a 3 2 0 
ModR/M mod Op. Ext. m (mt) 
7 6 5 a 3 z U 
{SIB} SIB byte 
7 6 5 4 3 2 U 
{DISPL} Displacement (8*N/32) 
31,8 y 
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Vector v11 - Template 


VectorVector v11 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, zmm2, S(mz) Load/brodcast and OP vector S(m:) with 
zmm_2 and write result into zmm1, under write- 
mask k1 
Description 
Operand is a memory location 
ESCAPE(62) 0 1 1 0 0 0 1 0 
7 6 5 a 3 z U 
MVEX1 IZ 13 1X IB IZ14 m3, mga My mo 
7 6 5 q 3 z U 
MVEX2 W | !Z23  !Z2. '!Z2, 1229 | L=0 | py Po 
7 6 5 a 3 z 1 
MVEX3 EH So Sy So | !Z24 | Klg Kl, Ko 
7 6 5 q 3 z y 
OPCODE OPCODE 
7 6 5 a 3 Z U 
ModR/M mod reg (ZMM1) m (mt) 
7 6 5 a 3 z U 
{SIB} SIB byte 
7 6 5 7 3 z y 
{DISPL} Displacement (8*N/32) 
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Vector v2 - Template 


VectorVector v2 
Opcode Instruction Description 
MVEX.512 VOP k2 {k1},zmm2, S(zmm3/m,) Operate vector zmm2 and vector S(zmm3/mz) 
and store the result in k2, under write-mask k1 
MVEX.512 VOP k2 {k1},zmm2, S(zmm3/m,), Operate vector zmm2 and vector S(zmm3/m,) 
imm8s and store the result in k2 using imm8, under 
write-mask k1 
Description 


Operand is a register 


ESCAPE(62) | 0 1 1 0 0 0 ul 0 

7 6 5 2 3 z U 
MVEX1 1 1Z24 1725 1 m3 mg, my mo 

7 6 5 4 3 z U 
MVEX2 W | !Z13 !Z12 !Z71, !Z1o | L=0 | pi Po 

7 6 5 q 3 z 0 
MVEX3 EH So Si So | !Z14 | Klg K1ly Ko 

7 6 5 a 3 z U 
OPCODE OPCODE 

7 6 5 a 3 2 0 
ModR/M 11 reg (K2) r (ZMM2) 

7 7 3 
{IM M8} Iz I Ts I T3 Ig I, Io 

7 6 5 a 3 Z U 
Operand is a memory location 
ESCAPE(62) | 0 1 1 0 0 0 1 0 

7 6 5 q 3 z 1 
MVEX1 1 1X 'B 1 ms mez my mo 

7 6 5 4 3 z U 
MVEX2 WwW IZ13 IZ1o 714 IZ1o L=0 Pi Po 

7 6 5 q 3 z y 
MVEX3 EH So Sy So | !Z14 | Klg Kl, Ko 

7 6 5 a 3 z U 
OPCODE OPCODE 

7 6 5 4 3 z 0 
ModR/M mod reg (K2) m (mt) 

7 6 q z iy 
{STB} SIB byte 

7 6 5 a 3 z 0 
{DISPL} Displacement (8*N/32) 

31,8 4 ‘ . : . 0 

{IM M8} Iz I6 Ts I T3 Ig I, Io 
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Vector v3 - Template 


VectorVector v3 
Opcode Instruction Description 
MVEX.512 VOP m: {k1}, D(zmm1) Store vector D(zmm1) into m;:, under write- 
mask k1 
Description 
Operand is a memory location 
ESCAPE(62) 0 1 1 0 0 1 0 
7 6 5 4 3 z T U 
MVEX1 IZ13 1X IB IZ 14 M3 mg M1 mo 
7 6 5 q 3 z T y 
MVEX2 W 1 1 1 1 | L=-0 |] pi Po 
7 6 5 a 3 z T U 
MVEX3 EH | Sp Si So 1 | Klg Kl, Kilo 
if 6 5 4 3 z T U 
OPCODE OPCODE 
7 6 5 q 3 z T 1 
ModR/M mod reg (ZMM1) m (mt) 
7 6 5 a 3 z T U 
{SIB} SIB byte 
7 6 5 4 3 z T y 
{DISPL} Displacement (8*N/32) 
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VectorVector v4 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, zmm2/m; Operate vector zmm2/m, and store the result 
in zmm1, under write-mask k1 
MVEX.512 VOP zmm1 {k1},zmm2/m;,imm8 Operate vector zmm2/m, and store the result 
in zmm1 using imm8, under write-mask k1 
Description 


Operand is a register 


ESCAPE(62) 
MVEX1 
MVEX2 
MVEX3 
OPCODE 
ModR/M 


{IM M8} 


Operand is amemory location 


ESCAPE(62) 
MVEX1 
MVEX2 
MVEX3 
OPCODE 
ModR/M 
{SIB} 
{DISPL} 


{IM M8} 


0 1 1 0 0 0 1 0 
7 6 5 Z 3 Z T U 
IZ 1s IZ 24 1725 Z14 m3 ms, M41, mo 
7 6 5 q 3 z T 0 
W 1 1 1 1 | L=0 |) py Po 
7 6 5 Zq 3 z T 0 
EH 0 0 0 1 | Klg Kl, Klo 
7 6 5 z 3 z T U 
OPCODE 
7 6 5 a 3 2 T 0 
11 reg (ZMM1) r (ZMM2) 

6 5 q 3 z T 0 

Ty I¢ Ts I4 Tz D qi, Ip 

i 6 5 z 3 Z T D 

0 1 1 0 0 0 1 0 

7 6 5 q 3 z T 0 
IZ 15 1X IB IZ14 m3 mg, My mo 

7 6 5 a 3 2 I 0 

W 1 1 1 1 L=0 Pi Po 

7 6 5 q 3 2 T 0 
EH 0 0 0 1 | Klp Kl, Klo 

7 6 5 Z 3 Z T U 

OPCODE 
7 6 5 q 3 z T 0 
mod reg (ZMM1) m (mt) 
7 6 5 q 3 z T 0 
SIB byte 
7 6 5 q 3 z T U 
Displacement (8*N/32) 

31,8 : : ; . U 

Ty I¢ Ts I4 Tz 1D qi, Ip 

‘4 6 5 q 3 z T 0 
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VectorVector v5 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, S(zmm2/mz:) Operate vector S(zmm2/m,) and store the re- 
sult in zmm1, under write-mask k1 
MVEX.512 VOP zmm1 {k1}, S(zmm2/m,), Operate vector S(zmm2/m,) and store the re- 
imm8 sult in zmm1 using imm8, under write-mask k1 
MVEX.512 VOP zmm1 {k1}, S(zmm2/m:) Move vector S(zmm2/m;) into zmm1, under 
write-mask k1 
Description 


Operand is a register 


ESCAPE(62) 
MVEX1 
MVEX2 
MVEX3 
OPCODE 
ModR/M 


{IM M8} 


0 1 


1 0 0 0 


7 6 


D 4 3 Z 


1713 'Z2Q4 1723 IZ14 m3, mg, my mo 
7 6 5 4 


Operand is a memory location 


ESCAPE(62) 
MVEX1 
MVEX2 
MVEX3 
OPCODE 
ModR/M 
{SIB} 


{DISPL} 


3 
W 1 1 1 1 | L=0 | pi Po 
7 6 5 q 3 z T 0 
EH So Sy So 1 | klg Kil, Kilo 
7 6 5 a 3 Z T U 
OPCODE 
7 6 5 4 3 2 T 0 
11 reg (ZMM1) r (ZMM2) 

6 5 a 3 Z T 0 
Ty Ig Ts I, T3 Ig I, Io 
7 6 5 a 3 2 T 0 
0 1 1 0 0 0 L 0 
7 6 5 a 3 Z T 0 
AE 1X IB IZ14 m3, mga my mo 
7 6 5 4 3 2 T 0 
WwW 1 1 1 1 | L=0 | pi Po 
7 6 5 7 3 z T 0 
EH So Sy So 1 | Kklg Kil, Klo 
7 6 5 a 3 Z T 0 
OPCODE 
7 6 5 4 3 z I U 
mod reg (ZMM1) m (mt) 
7 6 5 a 3 z T 0 
SIB byte 
7 6 5 a 3 2 T 0 
Displacement (8*N/32) 
31,8 E A é 7 0 
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Vector v6 - Template 


VectorVector v6 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, S(mv;) Gather sparse vector S(mv;) into zmm1, using 
completion mask k1 
MVEX.512 VOP mv; {k1}, D(zmm1) Scatter vector D(zmm1) into sparse vector 
mu, using completion mask k1 
Description 


Operand is a memory location 


ESCAPE(62) 0 1 1 0 0 0 1 0 
7 6 5 q 3 z iy 
MVEX1 IZ 13 1X3 'Bs 'Z14 m3 mg My Mo 
7 6 5 1 3 z U 
MVEX2 W 1 i 1 1 L=0 | pi Po 
if 6 5 4 3 z U 
MVEX3 EH Sy Sy So | 1X4] Kle K1li Ko 
7 6 5 q 3 z iy 
OPCODE OPCODE 
7 6 5 z 3 z 0 
ModR/M mod reg (ZMM1) m= 100 
7 6 5 4 3 z U 
VSIB SS; SSo Index(X) Base(B) 
7 6 5 q 3 Z 
{DISPL} Displacement (8*N/32) 
31,8 : : 
{IM M8} Iz Ig Ts I4 Iz Ig I; Io 


7 6 5 ay 3 2 T 0) 


716 Reference Number: 327364-001 


(intel. 


APPENDIX F. GENERAL TEMPLATES 


Vector v7 - Template 


VectorVector v7 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, k2, S(zmm3/m,) Operate mask k2 and vector S(zmm3/m,) [and 
vector zmm1], and store the result in zmm1, un- 
der write-mask k1 
Description 


Operand is a register 


ESCAPE(62) 0 1 1 0 0 0 1 0 

7 6 5 q 3 z y 
MVEX1 IZ 1s 1734 1733 IZ14 m3 ms, my, mo 

7 6 5 a 3 z 0 
MVEX2 W 1 'K2. 'K2) !K29 | L=0 | pi Po 

7 6 5 q 3 z iy 
MVEX3 EH So Sy So 1 K1lg Kili Ko 

7 6 5 a 3 z 0 
OPCODE OPCODE 

7 6 5 4 3 z U 
ModR/M 11 reg (ZMM1) r (ZMM3) 

7 6 5 q 3 z iy 
Operand is a memory location 
ESCAPE(62) 0 1 1 0 0 0 1 0 

7 6 5 a 3 z 0 
MVEX1 VAR 1X IB IZ14 m3, mg my mo 

7 6 5 7 3 z iy 
MVEX2 W 1 '"K2Q. 12, 129 | L=0 ] py Po 

7 6 5 4 3 z U 
MVEX3 EH So Sy So 1 K1lg Kl, Ko 

7 6 5 q 3 z iy 
OPCODE OPCODE 

7 6 5 a 3 z U 
ModR/M mod reg (ZMM1) m (mt) 

7 6 5 4 3 z U 
{SIB} SIB byte 

7 6 5 q 3 z y 
{DISPL} Displacement (8*N/32) 

3L8 0 
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VectorVector v8 
Opcode Instruction Description 
MVEX.512 VOP zmm1 {k1}, zmm2,zmm3/m; Operate vector zmm2 and vector zmm3/m; 
[and vector zmm1] and store the result in 
zmm1, under write-mask k1 
MVEX.512 VOP zmm1i1 _ {ki}, zmm2, Operate vector zmm2 and vector zmm3/m; 
zmm3/mz, imm8 [and vector zmm1] and store the result in 
zmm1 using imm8, under write-mask k1 
Description 
Operand is a register 
ESCAPE(62) 0 1 1 0 0 0 1 0 
7 6 5 q 3 z U 
MVEX1 IZ 13 1734 1733 IZ14 m3 m2 my mo 
7 6 5 4 3 z U 
MVEX2 W | !Z23  !Z2. !Z2, !Z29 | L=0 | pi Po 
7 6 5 q 3 z y 
MVEX3 EH 0 0 0 172, | Klg K1, Kg 
7 6 5 z 3 z U 
OPCODE OPCODE 
7 6 5 4 3 z y 
ModR/M 11 reg (ZMM1) r (ZMM3) 
7 5 q 3 U 
{IM M8} Ty Ig Ts I T3 Ig I, Io 
7 6 5 4 3 z 1 
Operand is a memory location 
ESCAPE(62) 0 dl 1 0 0 0 1 0 
7 6 5 a 3 Z U 
MVEX1 IZ13 1X IB IZ14 m3, mga My mo 
7 6 5 a 3 2 0 
MVEX2 W | !Z23  !Z2. !Z2, !Z29 | L=0 | pi Po 
7 6 5 7 3 2 y 
MVEX3 EH 0 0 0 172, | Klp K1, Kg 
7 6 5 q 3 z U 
OPCODE OPCODE 
7 6 5 4 3 z U 
ModR/M mod reg (ZMM1) m (mt) 
7 6 5 a 3 z U 
{SIB} SIB byte 
7 6 5 4 3 z U 
{DISPL} Displacement (8*N/32) 
318 y 
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APPENDIX F. GENERAL TEMPLATES 
Vector v9 - Template 


VectorVector v9 
Opcode Instruction Description 
MVEX.512 VOP S(mv;) {k1} Prefetch sparse vector S(mv;), under write- 

mask k1 

Description 

Operand is a memory location 
ESCAPE(62) 0 1 1 0 0 1 0 
7 6 5 4 3 2 U 
MVEX1 1 1X3 'Bs 1 m3 mg my mo 
7 6 5 q 3 z 0 
MVEX2 W 1 1 1 1 L=0 | pr Po 
7 6 5 z 3 z U 
MVEX3 EH Sp SS, So | !X4 |} Klg Kl, Kilo 
7 6 5 4 3 2 U 
OPCODE OPCODE 
is 6 5 x 3 z 0 
ModR/M mod Op. Ext. m= 100 
7 6 5 a 3 Z 0 
VSIB SS; SSo Index(X) Base(B) 
7 6 5 4 3 z 
{DISPL} Displacement (8*N/32) 
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Scalar sO - Template 


scalarScalar sO 


Opcode Instruction Description 

OF/0F38/0F3A OP r16, r16/m16 Operate [r16 and] r16/m16, leaving the result 
in r16 

OF/0F38/0F3A OP r32, r32/m32 Operate [r32 and] r32/m32, leaving the result 
in r32 

REX.W OF/0F38/0F3A OP r64, r64/m64 Operate [r64 and] r64/m64, leaving the result 
in r64 

Description 


Operand is a register 


C4 Version 
ESCAPE(C4) 1 1 0 0 0 1 0 0 
7 6 5 a 3 z T U 
VEX1 ldstz 1 '!srceg | m4 m3 mg my mo 
7 6 5 q 3 z T 0 
VEX2 W 1 1 1 1 | L=0 | py po 
7 6 5 Z 3 Z T U 
OPCODE OPCODE 
7 6 5 q 3 z T 0 
ModR/M 11 reg (dst) r (src) 
7 6 5 q 3 z T U 
C5 Version 
ESCAPE(C5) 1 1 0 0 0 1 0 1 
7 6 5 q 3 z T 
VEX2 ldstz | 1 1 1 1 | L=0 | py po 
7 6 5 Z 3 Z T U 
OPCODE OPCODE 
7 6 5 a 3 z T 0 
ModR/M 11 reg (dst) r (src) 
7 6 5 q T U 
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Scalar s1 - Template 


scalarScalar s1 


Opcode Instruction Description 
VEX.128 OP m: Prefetch/Evict m; memory location 
Description 


Operand is a memory location 


C4 Version 
ESCAPE(C4) | 1 iL 0 0 0 1 0 0 

7 6 5 a 3 Z T y 
VEX1 1 IX !B me M3 My. M, Mo 

7 6 5 a 3 2 I 0 
VEX2 wii 1 1 1 | L=0}] pi po 

7 6 5 q 3 z T 0 
OPCODE OPCODE 

7 6 5 a 3 Z T U 
ModR/M mod Op. Ext m (mt) 

T 6 5 4 3 2 T 0 
{SIB} SIB byte 

7 6 5 7 3 z T y 
{DISPL} Displacement (8/32) 

31,8 ; : ; , U 

C5 Version 
ESCAPE(CS5) | 1 1 0 0 0 1 0 1 

7 6 5 z 3 Z T 
VEX2 1 1 1 di 1 | L=0 |] py po 

7 6 5 a 3 z T 0 
OPCODE OPCODE 

7 6 5 q 3 z T 1 
ModR/M mod Op. Ext. m (mt) 

7 6 5 a 3 z T y 
{SIB} SIB byte 

7 6 5 4 3 2 I 0 
{DISPL} Displacement (8/32) 

31,8 : ; 
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